Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.
This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.
This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.
Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.
For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744
Basing your argument around how the model or training system works doesn’t seem like the best way to frame your point to me. It invites a lot of mucking about in the details of how the systems do or don’t work, how humans learn, and what “learning” and “knowledge” actually are.
I’m a human as far as I know, and it’s trivial for me to regurgitate my training data. I regularly say things that are either directly references to things I’ve heard, or accidentally copy them, sometimes with errors.
Would you argue that I’m just a statistical collage of the things I’ve experienced, seen or read? My brain has as many copies of my training data in it as the AI model, namely zero, but “Captain Picard of the USS Enterprise sat down for a rousing game of chess with his friend Sherlock Holmes, and then Shakespeare came in dressed like Mickey mouse and said ‘to be or not to be, that is the question, for tis nobler in the heart’ or something”. Direct copies of someone else’s work, as well as multiple copyright infringements.
I’m also shit at drawing with perspective. It comes across like a drunk toddler trying their hand at cubism.
Arguing about how the model works or the deficiencies of it to justify treating it differently just invites fixing those issues and repeating the same conversation later. What if we make one that does work how humans do in your opinion? Or it properly actually extracts the information in a way that isn’t just statistically inferred patterns, whatever the distinction there is? Does that suddenly make it different?
You don’t need to get bogged down in the muck of the technical to say that even if you conceed every technical point, we can still say that a non-sentient machine learning system can be held to different standards with regards to copyright law than a sentient person. A person gets to buy a book, read it, and then carry around that information in their head and use it however they want. Not-A-Person does not get to read a book and hold that information without consent of the author.
Arguing why it’s bad for society for machines to mechanise the production of works inspired by others is more to the point.
Computers think the same way boats swim. Arguing about the difference between hands and propellers misses the point that you don’t want a shrimp boat in your swimming pool. I don’t care why they’re different, or that it technically did or didn’t violate the “free swim” policy, I care that it ruins the whole thing for the people it exists for in the first place.
I think all the AI stuff is cool, fun and interesting. I also think that letting it train on everything regardless of the creators wishes has too much opportunity to make everything garbage. Same for letting it produce content that isn’t labeled or cited.
If they can find a way to do and use the cool stuff without making things worse, they should focus on that.
Another good question is why AIs do not mindlessly regurgitate source material. The reason is that they have access to so much copyrighted material. If they were trained on only one book, they would constantly regurgitate material from that one book. Because it’s trained on many (millions) books, it’s able to get creative. So the argument of OpenAI really boils down to: “we are not breaking copyright law, because we have used sufficient copyrighted material to avoid directly infringing on copyright”.
I agree, but the fact that shills for this technology are also wrong about it is at least interesting.
Rhetorically speaking, I don’t know if that’s useless.
I do like this point a lot.
I do miss when the likes of cleverbot was just a fun novelty on the Internet.