AI systems like ChatGPT and Claude have been on everyone’s lips for the past two years. Two years in which Hollywood writers and actors defended themselves against the use of their work in the context of artificial intelligence as part of a historic strike. How necessary this critical examination of these Large Language Models (LLMs) has now become clear again with the revelation of a new source that feeds these systems’ large data sets.
AI scandal: 139,000 films and series in the data set
What do Alf and The Godfather have in common? They are two titles that launched The Atlantic author Alex Reisner’s investigative tech journalism. A person from the script department had told him that generative AI could more or less reproduce both the alien sitcom and the mafia epic. But that was just the beginning, because as the journalist subsequently revealed over 53,000 films and over 85,000 TV episodes used for a data set called The Pile, which trains AI from Microsoft, Meta, Apple, Anthropic, among others.
What is particularly surprising is the discovery of where these data sets come from: They come from a website on which you can find ripped subtitle data can download. There are over 9 million different pieces of data there. With the help of his technical expertise, Reisner was then able to trace the files available in the AI systems and revealed something astonishing: every Oscar-nominated film from 1950 to 2016 is present in the data set, as well 616 episodes of The Simpsons, 170 episodes of Seinfeld plus prestige series like Twin Peaks, The Wire, Breaking Bad and The Sopranos.
AI on trial: Creative people behind titles like Game of Thrones and Breaking Bad are defending themselves
Just last year, the same author reported in the Atlantic how 183,000 books were used for this data. As Variety reported at the time, prominent authors such as George RR Martin (Game of Thrones) and others had already begun to legally defend themselves against it. The process is still ongoing.
Breaking Bad creator Vince Gilligan even suspected at that time that his series masterpiece had probably long since been baked into the AI machinery (without being asked). In an open letter (via Variety), he wrote, among other things:
I’m sure every word from Breaking Bad was squeezed in there somewhere. I just don’t remember giving my permission. […] Perhaps these companies have figured out that it would be better to ask for forgiveness rather than permission.
The legal situation regarding AI development and its data sets is still unclear in many areas. Can LLMs use copyrighted material without a license? And conversely, are AI creations eligible for copyright protection if they essentially just remix human-created material? All of this will have to be decided legally in the near future. And to protect the countless creatives who make our films and series, better sooner than later.