AI that perfectly imitates the human voice

AI that perfectly imitates the human voice

Microsoft has just presented VALL-E 2, the second version of its AI capable of synthesizing someone’s voice from an audio sample of only three seconds. The result is now indistinguishable from a human voice.

Microsoft presented VALL-E in January 2023 – in reference to the small Pixar robot – an artificial intelligence tool – and more precisely a “neural codec language model” for text-to-speech – capable of reproducing any voice (see our article). In itself, it was not new. What was new, however, was its speed of learning, since it only needed a three-second excerpt to “copy” the voice, as well as its ability to replicate the emotions of the person speaking. In addition, it was able to create a recording of words and sentences that the speaker had never spoken before.

Microsoft has taken a new step forward and has just announced, in a blog postits second version, soberly named VALL-E 2. Until now, the AI’s productions had small imperfections in the formulation or intonation that allowed us to guess that they were artificial. However, with VALL-E 2, the company’s researchers believe they have managed to achieve for the first time a “human parity”meaning that synthesized speech cannot be distinguished from that of a real person in benchmark tests. A major advance in the field of speech synthesis, but one that poses a real challenge in terms of ethics and security.

VALL-E 2: a synthetic voice impossible to differentiate from a human one

To improve VALL-E’s rendering, Microsoft added two major technological innovations to the way AI processes speech data: repetition-aware sampling and batch code modeling. The former allows AI to convert text to speech more smoothly and naturally, avoiding repetitions of “tokens” – small units of language like words or parts of words – while the latter increases the tool’s efficiency by reducing the number of tokens it has to process in a single input sequence. This helps speed up speech generation, “even for sentences that are traditionally difficult because of their complexity or repetitive phrases”.

© Microsoft

The results obtained are impressive! We let you listen to the different tests of this technology on the Microsoft page, in order to judge for yourself. The company believes that VALL-E 2 could be used in different sectors, such as “educational learning, entertainment, journalism, self-written content, accessibility features, interactive voice response systems, translation, chatbot”. These include high-quality text-to-speech applications, speech editing – when a person’s recording is edited and modified from a text transcription – or the creation of audio content, in VALL-E, and other generative AI models – including videos or 3D animation, for example. Microsoft also sees the possibility of using it to help people with disabilities.

VALL-E 2: the door open to new dangerous excesses

Such technology is incredible, but unfortunately it is not without risk. A malicious person could use it to impersonate the voice of a loved one, celebrity or politician during a phone call, and thus obtain large payments of money or spread false or sensitive information. The abuses caused by artificial intelligence are not new. Just look at deepfakes – photos or videos that use artificial intelligence to place a face on another face, and therefore reproduce “fake” people – used for revenge porn or fake news.

Microsoft is well aware of this, so the Redmond firm has decided not to make VALL-E 2 accessible to the public. It also specifies that the tool was designed solely for research purposes, with no intention of subsequently integrating it into a product or broadening access to the public. It is indeed more prudent like that…

ccn1