Do generative AIs lean right or left? Is it possible to influence them? The question is of great interest to political circles. Large language models, LLMs in English, are deep neural networks trained on large amounts of unlabeled text. They therefore need data on which to train them. OpenAI’s GPT-4 was trained on 40 terabytes including Common Crawl, massive content from 15 million websites sucked since April 2019, dialogues from YouTube videos, code from GitHub, Wikipedia and some free novels from non-authors published. The content that was used to train GPT-4 therefore does not belong to OpenAI, and this sparked a fierce battle.
Some media outlets have already reached agreements for the use of their archives, such as the Associated Press in July. Groups, like Axel Springer, have gone further, integrating all of their past and future articles directly into the model’s training set. An agreement worth more than $10 million per year. Finally, others have chosen the legal route. At the end of December, the New York Times filed suit for copyright infringement on millions of articles. In its complaint, the American daily says it contacted Microsoft and OpenAI in April to raise concerns about the use of its intellectual property and explore an amicable resolution, without success. It is the first large organization to engage in this battle. She probably won’t be the only one. The CEO of Condé Nast recently told a US Senate hearing that many artificial intelligence tools were built with stolen resources. Last September, Microsoft announced that if customers using these tools faced copyright infringement claims, it would compensate them and cover associated legal costs.
While waiting to negotiate licensing deals with OpenAI or win their cases in court, media companies are imposing a digital blockade. Data collected by the start-up Originality AI on 44 major news sites shows that 39 of them, including general ones like New York Times, THE Washington Post, THE Guardian, magazines like The Atlantic or specialized sites like Bleacher Report, block web crawlers. More broadly, OpenAI’s GPTBot is not welcome on 1.6% of the million most visited websites. But the major right-wing news outlets surveyed, including Fox News and Breitbart, do not block crawlers. No more than the media The Free Press, of the new muse of the American conservative right Bari Weiss, publicized for her fight with her former employer, the New York Times.
Is this a strategy to influence AI? The most plausible explanation is that some of these media did not have the time or the means to look into the issue. Some actually blocked their site from crawlers shortly after the publication of the Originality AI study. But the question interests the conservative sphere, which regularly accuses LLMs of leaning to the left. Even Grok, Elon Musk’s anti-woke LLM, was criticized by conservatives who deemed him too progressive, for which the X boss apologized by blaming Internet content.
Generative AI, however, is far from being binary. On the one hand, the content produced by the media constitutes only a drop in the ocean of knowledge ingested by LLMs. On the other hand, the models then generally go through a retraining phase based on the evaluations of human employees, what is called human feedback reinforcement, or RHLF. An example ? At a recent Senate hearing on artificial intelligence, Republican Senator Marsha Blackburn recited a ChatGPT-generated poem praising President Biden, before claiming it was impossible to generate a similar ode to Trump . In reality, it’s completely doable. ChatGPT is not as dogmatic as we would like you to believe.
*Robin Rivaton is general director of Stonal and member of the scientific council of the Foundation for Political Innovation (Fondapol)
.