Training "AI" On Public Data Is Totally Fine And Not Stealing.

31337@sh.itjust.works · 5 months ago

Training "AI" On Public Data Is Totally Fine And Not Stealing.

wewbull@feddit.uk · 5 months ago

Define “public”.

Publicly available is not the same as public domain. You should respect the copyright, especially of small creators. I’m of the opinion that an ML model is a derivative work, and so if you’ve trawled every website under the sun for data to feed your model you’ve violated copyright.

VoterFrog@lemmy.world · 5 months ago

There are multiple facets here that all kinda get mashed together when people discuss this topic and the publicly available/public domain difference kinda gets at that.

An AI company downloading a publicly available work isn’t a violation of copyright law. Copyright gives the owner exclusive right to distribute their work. Publishing it for anybody to download is them exercising that right.
Of course, if the work isn’t publicly available and the AI company got it, someone probably did violate copyright laws, likely the people who distributed the data set to the company because they’re not supposed to be passing around the work without the owner’s permission.
All that is to say, downloading something isn’t making a copy. Sending the work is making a copy, as far as copyright is concerned. Whether the person downloading it is going to use it for something profitable doesn’t really change anything there. Only if they were to become the sender at some later point does it matter. In other words, there’s no violation of copyright law by the company that can really occur during the whole “training” phase of AI development.
Beyond that, AI isn’t in the business of serving copies of works. They might come close in some specific instances, but that’s largely a technical problem that developers want to fix than a fundamental purpose of these models.
The only real case that might work against them is whether or not the works they produce are derivative… But derivative/transformative has a pretty strict legal definition. It’s not enough to show that the work was used in the creation of a new work. You can, for example, create a word cloud of your favorite book, analyze the tone of news article to help you trade stocks, or produce an image containing the most prominent color in every frame of a movie. None of these could exist without deriving from a copyrighted work but none of them count as a legally derivative work.
I chose those examples because they are basic statistical analyses not far from what AI training involves. There’s a lot of aspects of a work that are not covered by copyright. Style, structure, factual information. The kinds of things that AI is mostly interested in replicating.
So I don’t think we’re going to see a lot of success in taking down AI companies with copyright. We might see some small scale success when an AI crosses a line here or there. But unless a judge radically alters the bounds of copyright law, at everyone’s detriment, their opponents are going to have an uphill battle to fight here.

CptBread@lemmy.world · 5 months ago

An AI model could be seen as an efficient but lossy compression scheme, especially when it comes to images… And a compressed jpeg of an image is still seen as a copy so why would an AI model trained on reproducing it be different?

BluesF@lemmy.world · 5 months ago

Are you suggesting that the model itself is a compressed version of its training data? I think it requires some stretches of how training works to accept that.

FrenziedFelidFanatic@yiffit.net · 5 months ago

It depends on how much you compress the jpeg. If it gets compressed down to 4 pixels, it cannot be seen as infringement. Technically, the word cloud is lossy compression too: it has all of the information of the text, but none of the structure. I think it depends largely on how well you can reconstruct the original from the data. A word cloud, for instance, cannot be used to reconstruct the original. Nor can a compressed jpeg, ofc; that’s the definition of lossy. But most of the information is still there, so a casual observer can quickly glean the gist of the image. There is a line somewhere between finding the average color of a work (compression down to one pixel) and jpeg compression levels.

Is the line where the main idea of the work becomes obscured? Surely not, since a summary hardly infringes on the copyright of a book. I don’t know where this line should be drawn (personally, I feel very Stallman-esque about copyright: IP is not a coherent concept), but if we want to put rules on these things, we need to well-define them, which requires venturing into the domain of information theory (what percentage of the entropy in the original is part of the redistributed work, for example), but I don’t know how realistic that is in the context of law.

deaf_fish@lemm.ee · 5 months ago

For personal or public use, I’m fine with it. If you use it to make money, that’s when I get upsetti spaghetti.

Treczoks@lemmy.world · 5 months ago

It would be nice if the AI industry had one big positive effect by finally reigning in the overboarding copyright laws.

Admiral Patrick@lemmy.world · 5 months ago

If that were to happen, it’d only be for tech companies, not people. lol.

Treczoks@lemmy.world · 5 months ago

That might actually happen, yes.

WraithGear@lemmy.world · 5 months ago

I can agree, but any output must be instantly public domain.

31337@sh.itjust.works · 5 months ago

Yeah, I think that’s the current precedent in the US.

LibertyLizard@slrpnk.net · 5 months ago

I don’t have a problem with tech companies doing statistics on publicly available data, I have a problem with them getting rich by charging money for the collective creative works of humanity. But if they want to share their work for free, I have no issue with that.

wewbull@feddit.uk · 5 months ago

Yeah, because corporations never make money off things they make available free of charge. There’s no way this could go wrong.

Match!!@pawb.social · 5 months ago

if they’re using creative commons licenses (or other sharing licenses) then it’s fine! but the model is then alsp bound by the same licenses because that’s how licenses work

MajorHavoc@programming.dev · edit-2 5 months ago

This falls squarely into the trap of treating corporations as people.

People have a right to public data.

Corporations should continue to be tolerated only while they carefully walk an ever tightening fine line of acceptable behavior.

Xeroxchasechase@lemmy.world · edit-2 5 months ago

As long as it’s licensed as Creative Common of some sort. Copyrighted materials are copyrighted and shouldn’t be used without concent , this protect also individuals not only corporations. (Excuse my English)

Edit: Your argument about probability and parameter size is inapplicable in my mind. The same can be said about jpeg lossy compression.

wildncrazyguy138@fedia.io · 5 months ago

Could the copywrited material consumed potentially fall under fair use? There are provisions for research purposes.

Zagorath@aussie.zone · 5 months ago

Just fyi the term is “copyrighted”, not “copywrited”. Copyright is about the right to copy, not anything about writing.

31337@sh.itjust.works · 5 months ago

Incidentally, I read this a while ago, because I was training a classifier on mostly Creative Commons licensed works: https://creativecommons.org/2023/08/18/understanding-cc-licenses-and-generative-ai/

… we believe there are strong arguments that, in most cases, using copyrighted works to train generative AI models would be fair use in the United States, and such training can be protected by the text and data mining exception in the EU. However, whether these limitations apply may depend on the particular use case.

Zagorath@aussie.zone · 5 months ago

Creative Commons would not actually help here. Even the most permissive licence, CC-BY, requires attribution. If using material for training material requires a copyright licence (which is certainly not a settled question of law), CC would likely be just the same as all rights reserved.

(There’s also CC-0, but that’s basically public domain, or as near to it as an artist is legally allowed to do in their locale. So it’s basically not a Creative Commons licence.)

istanbullu@lemmy.ml · 5 months ago

I don’t get the AI hate.

evasive_chimpanzee@lemmy.world · 5 months ago

There are a lot of problems with it. Lots of people could probably tell you about security concerns and wasted energy. Also there’s the whole comically silly concept of them marketing having AI write your texts and emails for you, and then having it summarize the texts and emails you get. Just needlessly complicating things.

Conceptually, though, most people aren’t too against it. In my opinion, all the stuff they are labeling “generative AI” isn’t really “AI” or “generative”. There are lots of ways that people define AI, and without being too pedantic about definitions, the main reason I think they call it that, other than marketing, is that they are really trying to sway public opinion by controlling language. Scraping all sorts of copywritten material, and re-jumbling it to spit out something similar, is arguably something we should prohibit as copyright infringement. It’s enough of a gray area to get away with short term. By convincing people with the very language they use to describe it that they aren’t just putting other people’s material in a mixer, they are “generating new content”, they hope to have us roll over and sign off on what they’ve been doing.

Saying that humans create stories by jumbling together previous stories is a BS cop out, too. Obviously, we do, but humans have not, and do not have to give computers that same right. Also, LLMs are very complex, but they are also way way less complex than human minds. The way they put together text is closer to running a story through Google translate 10 times than it is to a human using a story for inspiration.

There are real, definite benefits of using LLMs, but selling it as AI and trying to force it into everything is a gimmick.

Eccitaze@yiffit.net · edit-2 5 months ago

I hate it because it’s a gigantic waste of time and resources. Big tech has poured hundreds of billions of dollars, caused double digit percentage increases in data center emissions, and fed almost the entire collective output of humanity into these models.

And what did we get for it? We got a toy that is at best mildly amusing, but isn’t really all that actually useful for anything; the output provided by generative AI is too unreliable to trust outright and needs to be reviewed and tweaked by hand, so at best you’re getting a minor productivity gain, and more likely you’re seeing a neutral or negative impact on your productivity (or producing low-quality crap faster and calling it “good enough”). At worst, it’s put a massive force multiplier in the hands of the bad actors using disinformation to tear apart modern society for their personal gain. Goldman Sachs released a report in late June where they pointed this out: if tech companies are planning on investing a trillion dollars into AI, what is the trillion dollar problem that AI is going to solve? And so far as I can tell, it seems that the answer to the question is either “it will eliminate millions of jobs and wipe out entire industries without any replacement or safety net, causing untold human suffering” or (more likely to be the case) “there is no trillion dollar problem AI can solve and the entire endeavor is pointless.”

Even ignoring the opportunity cost–the money spent could have literally solved the entire homelessness crisis, world hunger, lifted entire countries out of poverty, or otherwise funded solutions for real, intractable, pressing problems for all of humanity–even ignoring that generative AI has single-handedly erased years of progress in reducing our C02 emissions and addressing the climate crisis, even ignoring the logistical difficulty of the scale of build-out being discussed requiring a bigger improvement in our power grid than has been done basically ever, even ignoring the concerns over IP theft and everything else, fundamentally generative AI just isn’t worth the hype. It’s the crypto craze and NFT craze and metaverse craze (remember Zuckerberg burning 36 billion to make a virtual meeting space featuring avatars without legs?) all over again, except instead of only impacting the suckers who bought into the hype, this time it’s getting shoved in everybody’s face even if they want nothing to do with it.

But hey, at least it gave us “I Glued My Balls To My Butthole Again.” That totally makes the hundred billion investment worth it, right?

Railcar8095@lemm.ee · 5 months ago

As someone who doesn’t hate AI, I hate a few things about how it’s happening:

If I want to make a book, and I want to use other books for reference, I need to obtain them legally. Purchase, rent, loan… Else I’m a pirate. Multimillion companies say for them it’s fine as long as somebody posted it on the internet. Their version of annas-archive is suddenly legal and moral, while I’m harming the authors if I use it.
They are stuffing everything with AI, which generally means internet connection and sending unknown data.
It’s an annoying marketing gimmick. While incredible useful in some places, the insistence that it solves all the problems make it seem as a failure.

Hamartiogonic@sopuli.xyz · edit-2 5 months ago

Here’s an analogy that can be used to test this idea.

Let’s say I want to write a book but I totally suck as an author and I have no idea how to write a good one. To get some guidelines and inspiration, I go to the library and read a bunch of books. Then, I’ll take those ideas and smash them together to produce a mediocre book that anyone would refuse to publish. Anyway, I could also buy those books, but the end result would still be the same, except that it would cost me a lot more. Either way, this sort of learning and writing procedure is entirely legal, and people have been doing this for ages. Even if my book looks and feels a lot like LOTR, it probably won’t be that easy to sue me unless I copy large parts of it word for word. Blatant plagiarism might result in a lawsuit, but I guess this isn’t what the AI training data debate is all about, now is it?

However, if I pirated those books, that could result in some trouble. However, someone would need to read my miserable book, find a suspicious passage, check my personal bookshelf and everything I have ever borrowed etc. That way, it might be possible to prove that I could not have come up with a specific line of text except by pirating some book. If an AI is trained on pirated data, that’s obviously something worth the debate.

wildncrazyguy138@fedia.io · 5 months ago

To expand on what you wrote, I’d equate the LLM output as similar to me reading a book. From here on out until I become senile, the book is part of memory. I may reference it, I may parrot some of its details that I can remember to a friend. My own conversational style and future works may even be impacted by it, perhaps even subconsciously.

In other words, it’s not as if a book enters my brain and then is completely gone once I’m finished reading it.

So I suppose then, that the question is moreso one of volume. How many works consumed are considered too many? At what point do we shift from the realm research to the one of profiteering?

There are a certain subset of people in the AI field who believe that our brains our biological forms of LLMs, and that, if we feed an electronic LLM enough data, it’ll essentially become sentient. That may be for better or worse to civilization, but I’m not one to get in the way of wonder building.

Hamartiogonic@sopuli.xyz · 5 months ago

A neural network (the machine learning technology) aims to imitate the function to normal neurons in a human brain. If you have lots of these neurons, all sorts of interesting phenomena begin to emerge, and consciousness might be one of them. If/when we get to that point, we’ll also have to address several of legal and philosophical questions. It’s going to be a wild ride.

wewbull@feddit.uk · 5 months ago

You are equating traing an LLM with a person learning, but an LLM is not a person. It is not given the same rights and privileges under the law. At best it is a computer program and you can certainly infringe copyright by writing a program.

Hamartiogonic@sopuli.xyz · 5 months ago

An LLM is not a legal entity, nor should it be. However, similar things happen in a human brain and the network of an LLM, so same laws could be applicable to some extent. Where do we draw the line? That’s a legal/political issue we haven’t figured out yet, but following these developments is going to be interesting.

wewbull@feddit.uk · 5 months ago

Agreed it hasn’t been settled legally yet.

I also agree that an LLM isn’t and shouldn’t be a legal entity. Therefore an LLM is something that can be owned, sold, and a profit made from.

It is my opinion that the original author of the works should receive compensation when their work is used to make profit i.e. to make the LLM. I’d also say that the original intent of copyright law was to give authors protection from others making money from their work without permission.

Maybe current copyright law isn’t up to the job here, but benefiting of the back of others creative works is not socially acceptable in my opinion.

Hamartiogonic@sopuli.xyz · edit-2 5 months ago

I think of an LLM as a tool, just like drill or a hammer. If you buy or rent these tools, you pay the tool company. If you use the tools to build something, your client pays you for that work.

Similarly, OpenAI can charge me for extensive use of ChatGPT. I can use that tool to write a book, but it’s not 100% AI work. I need to spend several hours prompt crafting, structuring, reading and editing the book in order to make something acceptable. I don’t really act as a writer in this workflow, but more like an editor or a publisher. When I publish and sell my book, I’m entitled to some compensation for the time and effort that I put into it. Does that sound fair to you?

wewbull@feddit.uk · edit-2 5 months ago

Yes of course you are.

…but do you agree that if you use an AI in that way that you are benefitting from another author’s work? You may even, unknowingly, violate the copyright of the original author. You can’t be held liable for that infringement because you did it unwittingly. OpenAI, or whoever, must bare responsibility for that possible outcome through the use of their tool.

Hamartiogonic@sopuli.xyz · 5 months ago

Yes, it’s true that countless authors contributed to the development of this LLM, but they were not compensated for it in any way. Doesn’t sound fair.

Can we compare this to some other situation where the legal status has already been determined?

wewbull@feddit.uk · 5 months ago

I was thinking about money laundering when I wrote my response, but I’m not sure it’s a good analogy. It still feels to me like constructing a generative model is a form of “Copyright washing”.

Fact is, the law has yet to be written.

ChaoticNeutralCzech@feddit.org · 5 months ago

I agree with some other comments that this is a question of public domain vs. copyright. However, even copyright has exceptions, notably fair use in the US.

One of the chief AI critics, Sarah Andersen, made a claim 9 months ago that when AI generated the following output for “Sarah Andersen comic”, it clearly imitated her style, and if any AI company is to be believed, it’s going to get more accurate with later models, possibly creating a believable comic including text.

Regardless of how accurately the AI can draw the comics (as long as they aren’t effectively identical to a single specific comic of hers), shouldn’t this just qualify as fair use? I can imitate SA’s style too and make a parody comic, or even just go the lazy way and change some text like alt-right “memers” did. As long as the content is distributed as “homage”, “parody”, “criticism” etc., doesn’t directly harm the Sarah Andersen’s financial interests, and makes it clear that the author is clearly not her, I think there should be no issue even if it features likeness of trademarked characters, phrases and concepts.

Makes me ashamed there is a book by her in my house (my sister received it as a gift).

ImplyingImplications@lemmy.ca · 5 months ago

This argument is more along the lines of what is actually being argued by AI companies in court. Style cannot be copyrighted. They argue AI is simply recreating a style.

The problem with this is that, in order to recreate a style, AI needs to be trained on that content. So if an AI starts reproducing art in the same style as a popular artist, it must have inherently been fed a whole bunch of that artist’s work. Artists claim this is a violation of copyright since they never agreed for their art to be used in that way. The AI companies argue fair use also allows use of copyrighted works for teaching or training. An art class can use a popular artist’s work as examples of how to recreate a certain style. Of course, training AI is different than training a group of students. Is it different enough that fair use doesn’t apply is the question being decided on in court.

Boomkop3@reddthat.com · 5 months ago

“Statistiac” of course. And yes I would

fartsparkles@sh.itjust.works · 5 months ago

dowlo’t

LarmyOfLone@lemm.ee · 5 months ago

Huh I read your headline in a sarcastic tone so was totally ready to argue with you. But I agree. Not sure if it’s an unpopular opinion though.

Waldowal@lemmy.world · 5 months ago

Agree for these reasons:

Legally: It’s always been legal (in the US at least) to relay the ideas in a copywrited work. AI might need to get better at providing a bibliography, but that’s likely a courtesy more than a legal requirement.
Culturally: Access to knowledge should be free. It’s one of the reasons public libraries exist. If AI can help people gain knowledge more quickly and completely, it’s just the next evolution of the same idea.
Also Culturally: Think about what’s out on the internet. Millions of recipes, no doubt copied from someone else, with pages of bullshit about how the author “grew up on a farm that produced Mohitos”. For decades now, “content creators” have gotten paid for millions of low quality bullshit click bait articles. There’s that. Most of the real “knowledge” on the internet is freely accessible technical / product documentation, forum posts like StackOverflow, and scientific studies. All of it is stuff the authors would probably love to have out there and freely accessible. Sure, some accidental copywrite infringement might happen here and there, but I think it’s a tiny problem in relation to the value that AI might bring society.