Microsoft’s AI that mimics human voices

Microsofts AI that mimics human voices

Microsoft has just presented VALL-E, an AI capable of synthesizing the voice of any person from an audio sample of only three seconds, with all their emotions. A promising but frightening technology…

Microsoft is definitely counting on the artificial intelligence (AI) developed by OpenAI! Indeed, the firm plans to invest 10 billion dollars in this company – in addition to the billion dollars already invested in 2019 – and to integrate ChatGPT conversational AI into its Microsoft 365 suite and its Bing search engine. And this is only the beginning ! Indeed, she has just published a demonstration of its new artificial intelligence tool. Called VALL-E – in reference to the DALL-E image generator, still developed by OpenAI – it is capable of reproducing any voice. In itself, this is nothing new. What is interesting, however, is its speed of learning, since it only needs a three-second extract to “copy” the voice, as well as its ability to replicate the emotions of the person. who speaks. Additionally, it is able to create a recording of words and phrases that the speaker has never uttered before. And that’s just the beginning, since this type of AI improves over time. The results are as promising as they are worrying, opening the door to many abuses…

VALL-E: a larger-than-life voice from a 3-second extract

VALL-E is a “neural codec language model” for speech synthesis (Text To Speech), i.e. it can synthesize a voice from a written text. To do this, the researchers used machine learning and trained the AI ​​with more than 60,000 hours of English speech data uttered by more than 7,000 speakers reading free public domain audiobooks. available on LibriVox.

Microsoft has shared several obtained snippets on Github. The first table is divided into four columns that each contain an audio. The first, titled “Speaker Prompt”, is the three-second audio that allows VALL-E to synthesize a voice. The second, “Ground Truth”, is a recording made by the same speaker in order to be able to compare it with the result obtained by Microsoft’s AI. The third, “Baseline”, is an extract obtained with a conventional voice synthesis. Finally, the column “VAL-E” contains the snippet spoken by Microsoft’s AI.

© Microsoft

Subsequently, other extracts and comparisons are offered so that one can realize that artificial intelligence is able to generate grains of voice/random tones. Thus, the same sentence pronounced twice by the AI ​​​​will not have the same result. Similarly, it can keep the acoustic environment of the extract to synthesize the “fake” voice, but also keep the original emotion – Microsoft thus offers us examples for anger, sleepiness, amusement, disgust and neutrality. For the moment, the results are quite uneven: the synthesized voice is sometimes robotic, sometimes really stunning. But VALL-E will surely improve over time as it is still in its infancy.

VALL-E: the door open to new dangerous excesses

VALL-E could be used for high-quality text-to-speech applications, for speech editing – when a person’s recording is edited and modified from a text transcription – or for creating audio content by combining it with other generative AI models – including videos or 3D animation for example. However, unlike ChatGPT and DALL-E which are open source, Microsoft has not shared the code of its AI, in order to avoid abuses. It is therefore not possible to test the AI ​​yourself at the moment.

This is due to the fact that VALL-E raises questions of morality, ethics and security: could such a tool not be dangerous if it were open to the general public? It is in any case a question that works rightly Microsoft, which explains that,“Since VALL-E could synthesize the speech that makes the identity of a speaker, it may include risks of misuse, such as voice spoofing or impersonation of a specific speaker. To mitigate these risks, it is possible to build a detection model to determine if an audio clip has been synthesized by VALL-E. We will also put Microsoft AI principles of ethics into practice when developing later models.”

The excesses caused by artificial intelligence are not new. Just look at the deepfakes – photos or videos that use artificial intelligence to place a face on another face, and therefore replicate “fake” people – used for revenge porns or fakenews, ChatGPT hijacking as a cheat tool in the school environment, or even the creation of similar applications aimed at scamming users. We let you imagine the disasters that would result from the speech of a politician modified by this artificial intelligence… This is why it is essential to put in place protections before democratizing VALL-E. And again, it’s not sure that’s enough….

ccn1