Feeding artificial intelligence with data generated by AI, a risky bet

Feeding artificial intelligence with data generated by AI a risky

  • News
  • Published on
    updated on


    Reading 2 min.

    If artificial intelligence (AI) models are repeatedly trained with data generated by AI, they begin to produce increasingly inconsistent content, a problem highlighted by several scientific studies.

    The models that underpin generative AI tools like ChatGPT, which can generate all sorts of content based on a simple query in everyday language, need to be trained on an astronomical amount of data.

    Data that is often gleaned from the web, which increasingly contains images and texts created by AI. This “autophagy”, where AI feeds on AI, leads to a collapse of models, which produce answers that are at first less and less original and relevant and then end up making no sense, according to an article published at the end of July in the scientific journal Nature.

    Concretely, with the use of this type of data called “synthetic data” because generated by machines, the sample from which the artificial intelligence models draw to provide their answers loses richness.

    It’s like making a copy of a scanned image and then printing it. As you print, the result loses quality until it becomes illegible.

    “Mad Cow Disease”

    Researchers from Rice and Stanford universities in the US came to the same conclusion by studying the image-generating AI models Midjourney, Dall-E and Stable Diffusion.

    They showed that the generated images became increasingly common and were progressively peppered with incongruous elements as they added data.”artificial” to the model, comparing this phenomenon to the disease of the “mad cow“.

    This epidemic, which appeared in the United Kingdom, is believed to have originated from the use of animal meal in cattle feed, obtained from uneaten parts of bovine carcasses and from the corpses of contaminated animals.

    However, companies in the artificial intelligence sector frequently use “synthetic data” to train their programs because of its ease of access, abundance and low cost compared to human-created data.

    Untapped, high-quality, machine-readable human data sources are becoming increasingly rare.“, Jathan Sadowski, a researcher specializing in new technologies at Monash University in Australia, explained to AFP.

    Without any control for several generations, a disaster scenario“would be the collapse of models syndrome”poisons data quality and diversity for the entire Internet“, warned Richard Baraniuk, one of the authors of the Rice University paper, in a statement.

    Just as the mad cow crisis devastated the meat industry in the 1990s, an internet filled with AI-driven content and models gone “crazy” could threaten the future of a booming, multi-billion dollar AI industry, these scientists say.

    The real question for researchers and the companies building AI systems is: At what point does the use of synthetic data become too great?“, adds Jathan Sadowski.

    Feeling good in your body, feeling good in your head!

    Unrealistic scenario

    But for other specialists, the problem is exaggerated and far from inevitable.

    Anthropic and Hugging Face, two nuggets in the field of artificial intelligence, confirmed to AFP that they use data generated by AI.

    The journal article Nature offers an interesting theoretical perspective, but one that is not very realistic for Anton Lozhkov, a machine learning engineer at Hugging Face.

    Training (models) on multiple synthetic data sets simply doesn’t happen in reality” he assured.

    Mr. Lozhkov acknowledges, however, that AI experts are frustrated, like everyone else, with the state of the web.

    Part of the internet is trash” he says, adding that his company has already made major efforts to clean up the data collected, sometimes deleting up to 90% of it.

    dts6