ChatGPT: a technology that has already reached its limits?

Large Language Models, abbreviated LLM in English, are not magic. They are deep neural networks trained on large amounts of unlabeled text. They therefore need data on which to train them. OpenAI’s stroke of genius was to dare to train its models on very large volumes of text. While the GPT-1 model had access to only 4.5 GB of text from BookCorpus, GPT-3 was trained on 570 GB of text from Common Crawl, Web Text, English Wikipedia, GitHub, Reddit, and some free novels from unpublished authors. These datasets contain up to 10 trillion words. For GPT-4, OpenAI used Stack Overflow, a question-and-answer forum between developers, to improve the coding capabilities of its model.

Full robbery?

A recent study of Google’s T5 and Meta’s LLaMA models shows that they were trained on CommonCrawl’s C4 (Colossal Clean Crawled Corpus) dataset, a massive content of 15 million websites sucked since April 2019. The three most important sites are patents.google.com which contains worldwide patents, wikipedia.org And scribd.coma subscription-only digital library. B-OK.orga notorious pirated book exchange site, which has since been shut down, and 27 other counterfeit sites were also present in the dataset.

Artists, creators, news agencies and journalists have criticized LLM editors for using their content without permission or compensation. An ethical question arises. OpenAI’s latest model is proprietary and paid for, but the content used to train it does not belong to it. They are the slow sedimentation of two decades of exchanges between Internet users around the world. Wouldn’t there be appropriation? This is claimed by Reddit, Stack Overflow and News/Media Alliance, an American trade group of publishers, who want to demand compensation from companies that use their data.

The risk of looping

But the even greater fear is that the source of knowledge has dried up. People used to ask for and get help online, now they do it behind closed ChatGPT doors. SimilarWeb claims that traffic on Stack Overflow is already down 14% since January. The future training ground is being destroyed today. Especially since, since March 1, OpenAI has updated its terms of use to respond to the concerns of its users: it will no longer use customer data sent via its APIs to train its models, thus depriving itself of data additional workout. Worse still, as the texts produced by the LLMs invade the Web, they will feed the probabilistic models behind these same LLMs and reinforce their results, placing them in an infinite loop of self-fulfilling exchanges, and validating certain errors or hallucinations. Even assuming competing engines, these would inevitably end up converging.

Sam Altman, the boss of OpenAI, has himself claimed that the race for ever-larger LLMs is already over, due to the shortage of quality linguistic data as well as the high cost of computing power, burying the idea of a GPT-5. The race is rather in the training of smaller models, capable of operating on a computer without having to go through the clouds, and with personal data in order to have a real assistant. More and more actors believe that new approaches, different from that by LLM, will be necessary to continue the development of artificial intelligence.

The idea that the models have eaten their blank bread is also put forward by those who believe that the LLMs will not be able to overflow into a general artificial intelligence. Geoffrey Hinton, one of the pioneers of deep learning, who left Google to speak freely about the dangers of artificial intelligence, retorts that the latest LLMs are multimodal, capable of apprehending images, sounds or videos at beyond the text and will have access to the infinite source of content that we load on social networks. A new battle in sight, because Facebook and Twitter prohibit the recovery of their data.

* Robin Rivaton is Managing Director of Stonal and member of the Scientific Council of the Foundation for Political Innovation (Fondapol).

lep-life-health-03

Interest in planting projects growing in Sarnia-Lambton: advocate

New Jeep Renegade e-Hybrid is on sale for 1 million 849 thousand TL

Schnabel (ECB): forecasting errors on inflation contributed to delayed reaction

Bone of Saint Bernadette arrives in Utrecht: ‘the greatest gift from Lourdes’

Ilmari Käihkö: Russia may not even attempt a major attack – the means of subduing Ukraine may be much more subtle | Russian invasion

ChatGPT: a technology that has already reached its limits?

Full robbery?

The risk of looping