‘VALL-E’ was mind blowing! A 3 second example is enough for him, the next is very interesting

VALL E was mind blowing A 3 second example is enough

Chatbots that create visuals from text, chatbots that respond to questions like a human, now come the news that a new one has been added to artificial intelligence models. While work on artificial intelligence continues around the world, researchers from Microsoft announced a new artificial intelligence model that can generate voice from text. This model is called “VALL-E,” which brings to mind DALL-E, OpenAI’s artificial intelligence program that creates images from text.


According to the news of Ars Technica, a new artificial intelligence model that can generate voice from text called VALL-E, announced by Microsoft researchers on Thursday. The report says the VALL-E can closely mimic a person’s voice when given a three-second sample of sound. It is even stated that he can do this in a way that tries to preserve the emotion in the speaker’s tone.

Microsoft calls VALL-E a “neural codec language model” and says it leverages a technology called EnCodec, which Meta announced in October 2022.

It is stated that Microsoft has trained VALL-E with its library containing 60,000 hours of English speaking by more than 7,000 speakers.

It is speculated that VALL-E can be used for high-quality text-to-speech applications and audio content creation along with other AI models. But because artificial intelligence can closely mimic sounds, it can actually say something the speakers haven’t said.

In the shared ethics statement, it is underlined that VALL-E may carry potential risks related to abuse.


Numerous audio samples from VALL-E have been published on GitHub. Some examples look pretty surprising. In some of them, it seems that VALL-E draws conclusions according to environment and emotion. For example, if the speaker’s voice echoes or says something in anger, the system seems to produce a sound accordingly.