the incredible AI that makes photos talk and sing

the incredible AI that makes photos talk and sing

Alibaba has just unveiled its new generative AI called EMO, capable of making a person sing or speak based on a simple photo with striking realism. The result is as fascinating as it is disturbing…

Artificial intelligence continues to develop, to the point of showing us that it can do almost anything! ChatGPT and its variations are already capable of generating all kinds of text and code, while Midjourney and company create tailor-made images. But, after more than a year of seeing these tools improve and obtain ever more breathtaking results, the “Wow” effect begins to dissipate, to the point that we begin to get used to seeing the AI integrates a multitude of products and services. But artificial intelligence still has surprises in store for us! A few days ago, OpenAI unveiled Sora, its new AI capable of generating stunningly realistic videos from a simple textual description (see our article), while Suno AI has developed a tool of the same name capable of generating music in 30 seconds from a simple text request (see our article). And the least we can say is that the results are stunning!

It’s the turn of the Alibaba Institute for Computational Intelligence (Alibaba Cloud Intelligence) to reveal its latest prowess in a research article published on February 27. The Chinese owner of AliExpress has developed an AI model, called EMO, capable of transforming photos – more precisely portraits coupled with soundtracks – into realistic videos thanks to “advanced audio-video synthesis”. To put it simply, any photo can start singing, all with impressive lip precision. Want to make Eminem sing to a photo of Leonardo DiCaprio at 20? Treat yourself!

EMO: blackmail anything to anyone

The quality of the generated video can be perplexing due to its realism. The cheekbones and vocal cords of the extras move, as do their eyebrows. This even works with animated or painted characters like the Mona Lisa, who can then proclaim a Shakespeare monologue at ease!

To train EMO, the researchers “created an audio-video database rich in 250 hours of content and 150 million images”explains the article. “Audio content is rich in information regarding facial expressions, theoretically making it possible to generate a wide range of facial movements”. The feat lies in the fact that they were able to do without intermediate 3D models or facial landmarks to bring the portraits to life. EMO transforms audio data into facial animation to better understand user queries – a bit like what Sora does by transposing the visual data from his database into a script.Our method guarantees very expressive and realistic animations”underlines the article.

The results are stunning…and a little worrying. Indeed, this technology is the open door to disinformation, since it makes it possible to make anyone say anything based on a simple photo of good quality. If its use is even simpler than what is done today with deepfakes, we dare not imagine the chaos that this could create… The team of researchers says “perfectly conscious” ethical problems that EMO can generate. Also, AI is not yet made available to the general public, and the research team “committed to exploring methods for detecting synthetic videos”.



ccn1