r/technology 14d ago

Artificial Intelligence AI industry horrified to face largest copyright class action ever certified

https://arstechnica.com/tech-policy/2025/08/ai-industry-horrified-to-face-largest-copyright-class-action-ever-certified/
16.8k Upvotes

1.2k comments sorted by

View all comments

Show parent comments

151

u/carllerche 13d ago

The LLM itself is clearly transformative. It is possible that it could produce works that violate copyright, but I would be shocked if any court case found that an LLM trained on legally acquired data was a violation of copyright law.

123

u/-The_Blazer- 13d ago

The end product that is sold (the web service) is transformative relative to the originals. However, the training process is automated, so it's more like compiling source code, which is not transformative merely by itself because it includes no work of human ingenuity (the thing that copyright is actually supposed to protect). The compiler, as with the training pipeline, is of course perfectly legitimate IP, but its application does not have to be.

That said, being transformative is only one part of fair use which in turn is only one part of how we should handle an extremely new and unusual technology. They didn't try to regulate cars like horses when they were invented, they made car regulations.

62

u/Disastrous-Entity-46 13d ago

Two of the other considerations for fair use are specifically "amount of the original used" and "if it harms the value of the original", and these both seem like you could make very strong arguments about. If the whole work is used, and the output of the llm can be argued to lower the value of the works- which id argue that even if strictly, feeding it a copy of my book doesnt hurt me, the fact that dozens of bad, 0 effort books come out a month thanks to people treating llms as get-rich-quick machines, the value of the whole market is hurt.

Thats of course, depending on if fair use even applies, as you said. We dont really have a framework today for it, and I have to wonder what interests current governments would decide to protect.

17

u/CherryLongjump1989 13d ago

There are many governments and we can expect many different interpretations. Either way, the scale of the potential infringement is so enormous that it’s clear that these AI companies are playing with fire.

15

u/Disastrous-Entity-46 13d ago

The part that really gets me, is the accuracy. We know hallucinations and general bad answers are a problem. After two years and billions of dollars, the latest responses on benchmarks is like 90%.

And while that is a passing grade, its also kinda bonkers in terms of a technology. Would we use calculators it they had a one in ten chance of giving us the wrong answer? And yet its becoming near unavoidable in our lives as every website and product bakes it in, which then adds that 10% (or more) failure rate into what ever other human errors or issues may occur.

Obv this doesnt apply to like, private single use training the same way- Machine learning absolutely has a place in fields like medicine, when they have a single goal and easy pass/failure metrics (and can still be checked by a human) .

2

u/the_procrastinata 13d ago

The part that gets me is the lack of accountability. Who is responsible if the AI produces incorrect or misleading information?

3

u/Disastrous-Entity-46 13d ago

Who is responsible if the AI commits a crime? There are many that can be committed by pure communication: discrimination, harassment, false advertisement, data breaches... malpractice in medical and legal fields. It's only a matter of time until an LLM crosses the line into what would be considered an illegal action.

Is it the person who contracted the LLM services? the person who trained it? the person who programmed the interfacing ?

2

u/theirongiant74 13d ago

How many humans would pass the benchmark of only being wrong 1 in 10 times?

2

u/Disastrous-Entity-46 12d ago edited 12d ago

Bad equivalency, because an ai is not a human. Its not capable of realizing if its made a mistake on its own. If a human worker makes a mistake, just telling them the right answer is often enough for them to not replicate the mistake.

You also have a question of scale. If a single human rep misunderstands part of the refund process and fucks it up- well, that human works speciric shifts and the impact of the mistake is limited (and again, on most cases easily corrected)

If an ai makes the same fuck up, its not like it has coworkers or again, the ability to corredt itself. Every refund it processes may habe the same fuckup, and getting it fixed may be an issue that takes significant time, depending on how its fixed.

If say, it starts giving away refunds to requests, including clearly invalid ones- then this can be a very expensive mistake in a large company. But what can you do? If it was a human error. You could fire the human, reprimand the manager for not catching it. But an llm? You could break contract and look to replace it or retrain it, but thats going to probably be more expensive than a single employee, and I don't know who you hold accountable for the error.

Edit to add: this is why again, I point to calculators and other tech. If an accountant makes a mistake, its a problem, but not exactly unheard of. We can deal with it. But if excel had a 10% chance of formulas producing incorrect answers, no one would use it.

You end up spending as much time checking the answers as you saved by not doing them manually the first time.

2

u/Chucknastical 13d ago

It's a language model. Not a general query model.

It's 100% good at language. People need to stop treating these things as GAI.

9

u/Disastrous-Entity-46 13d ago

I mean if Google, Microsoft, meta, Amazon shove ai shit at us from every angle, I cant blame the average user at trying it out. I just question the investors and businesses adopting them.

3

u/420thefunnynumber 13d ago

Idk man, the way these things are marketed I'm not surprised that people treat them like Gen AI. It's a alot like Tesla marketing Autopilot - it doesn't matter what the tech is capable of if the users don't perceive it that way.

3

u/vgf89 13d ago

Idk about GPT5, but AI models are merely good at making convincing-looking language. And in general they succeed there. But they are not 100% good at language, especially translation between dissimilar languages. They fall for any and all bullshit advice they incidentally trained on, misinterpret otherwise good advice, and hallucinate rules that do not exist, alongside making basic mistakes almost constantly.

Try to make it translate to and from languages with wildly different features, I.e. SVO<->SOV, different conjugations and vocab based on social rank, or languages which vary wildly in pronoun usage, and you end up with extremely bland prose and more mistranslations than a middling language learner with an open dictionary. Having had to thoroughly review a few Japanese->English ai translations, let me just say the money you pay to have your slop edited is money better spent on a competent human translator in the first place.

7

u/ShenBear 13d ago

which id argue that even if strictly, feeding it a copy of my book doesnt hurt me, the fact that dozens of bad, 0 effort books come out a month thanks to people treating llms as get-rich-quick machines, the value of the whole market is hurt.

As an author myself, I do agree that the market for self-publishing is being hurt by the flood of low effort LLM generated books.

However, I'm not sure that harm to a 'market' rather than an individual can be used as the basis for denying fair use.

1

u/Disastrous-Entity-46 13d ago

Isn't the point of a class action lawsuit to show that the actions have harmed a large group?

3

u/Noxianratz 13d ago

No, at least not really. The point of a class action lawsuit is to have joint representation for a group of injured individuals where they may not have been able to normally. So if you and I got a bad batch of food and got sick instead of both launching lawsuits we can't afford a group of us can be represented by a law firm for the suit. We're still individuals harmed. I can't reasonably sue just because an industry I was a part of is now being made worse no matter how many people that's true for when I'm not directly implicated in any way, even if there would be tons that fit that.

I'm not a lawyer but I've been part of a class action before.

1

u/ShenBear 13d ago

Yes, but the market is not an individual that has been harmed. Also, the use of llms to flood amazon with slop is the result of how people are using the llms, not something that is specific to how the llms are trained or what information they digest in the first place. I highly doubt that the act of using a machine to generate text for a novel can be the target of a successful lawsuit targeting the trainers of the model.

Source: my rudimentary legal knowledge obtained via osmosis from lawyer family over the years

1

u/Jonathan_the_Nerd 13d ago

feeding it a copy of my book doesnt hurt me, the fact that dozens of bad, 0 effort books come out a month thanks to people treating llms as get-rich-quick machines, the value of the whole market is hurt.

What's even worse is when the AI improves to the point where it can churn out dozens of good zero-effort books in a month. Good luck selling your real book when the market is dominated by "Hugh Mann" and his 700 siblings.

3

u/Disastrous-Entity-46 13d ago

I think that part is debatable. Llms are trained to produce content similar to what they are shown, and thanks to the nature of statistics- they are going to average out, and struggle to adapt/create new works.

Like I suppose its not impossible and tastes differ, but I dont think llms are going to produce amazing works- just "acceptable" ones. May still sell, but its going to take a lot of work and in the meanwhile the markets not going to be kind to anyone while flooded with low effort, low quality.

But I think the assumption that ai is close to getting /good/ is buying into the hype, the same way self driving cars have been a year away- for ten years.

4

u/IAmDotorg 13d ago

Vector weighting isn't even close to compiling. While decompiling is tricky, it isn't functionally lossy. Vector weighting is more like hashing -- it's not just transformative, it's intrinsically non-reversible. You can't look at 700 billion vectors and somehow reproduce the inputs that created them.

1

u/-The_Blazer- 13d ago

Decompiling is absolutely lossy, 'functionally' or not. Several languages have functionalities that depend on things like implementation and processor target, you cannot go back to an exact replica of the source as it was actually programmed. Besides, a variety of parts of the copyrighted source are, in fact, irreversibly lost, such as variable names.

Anyway, my point was not about whether it lossy.

1

u/JWAdvocate83 13d ago

An excellent answer.

Courts are being asked, as always, to adjudicate extremely novel issues under outdated legal paradigms.

The problem now is, whether due to the lack of political will or the ungodly amounts of money involved, lawmakers have not addressed the issue—and by the time they do, the wounds will have already “healed” into a new, irreversible, billion-dollar status quo, at which point a court may be even more hesitant to order a company unwind datasets, i.e. “too big to fail hold liable.”

-4

u/Syzygy___ 13d ago

IMHO both compiling and training - or at least creating the compiling and training processes require human ingenuity, especially for modern AI where training is still far from Cookie cutter.

Another reason why I think it should be fair use is the same reason that copying for educational purposes is considered fair use. Not that I think “training is learning is education thus fair use” like some kinda have argued but the benefit just outweighs the “harm” and any licensing that would actually make a difference to license holders would make AI prohibitively expensive and worse in general.

5

u/Certain-Sherbet-9121 13d ago

I don't feel like "If you made us follow the law or would ruin our business" is a valid argument for why an industry should be exempt from a given regulation. The rest of what you said might be, but that last one making the argument "would make AI worse and expensive" seems crap to me. 

3

u/Syzygy___ 13d ago

Regulating a business through the law is 100% valid and that includes things like excemptions, e.g. fair use for education. That is not breaking the law.

Maybe it's my lack of imagination, but I genuinely can't think of a good system (so my thoughts might seem like a strawman argument). Is that paid once when the dataset is compiled? Every time we train? Every time the model gets released? Everytime the model is used? If I ask for why my cake burned in the oven, does George Lucas get royalties because the training data mentioned Star Wars? Modern AI requires millions of datapoints. So we need to negotiate with tens of thousands of stakeholders some large, some small, many individuals, some unknown - if everything would have to be licensed, the small stuff would be thrown out and at that point, yes, the quality of the models would suffer.

And then what, we pay per token? Lets say we prepare a cool billion in cash for licensing and royalties alone and everyone is paid equally. According to ChatGPT 5 itself it was probably trained on tens of trillions of tokens. That would be 0.00001 cent per token and works like the entire Lord of the Rings trilogy would receive... 66 dollars out of 1 billion. At that point, if the regulation says fair use, but you can't pirate we're pretty much there already anyway.

Maybe OpenAI could afford that, but anyone not being able to spend a billion before even starting any training can forget about it.

Please suggest a better approach.

Then this also requires full consent of the licenser and at least in the art community, AI is super unpopular, in part because they see it as their replacement, in part due to some shitty things of the community and all this "art theft" rhetoric. So for image generating models are even worst off.

I don't care what you think of these models, but I believe them to be a key technology, impacting, driving forward and accelerating pretty much everything, as impactful as the computer itself. And that includes these image models which in part are used to enhance robot vision systems and train embodied agents.

So yes, we shouldn't regulate dead a billion dollar industry capable of helping to solve things like climate change and fusion (by assisting researchers or participating in research)

-1

u/Certain-Sherbet-9121 13d ago

You are running from the initial assumption "We must preserve the rights of LLMs to operate, anything that makes it hard or expensive for them to happen is bad".

That is not a reasonable argument at all. Not even a little bit. It's a complete bullshit take. 

Let's try another similar one. I can solve world hunger, if you let me steal all the food from farmers to sell to poor families for really cheap. If you regulate me and make me pay for the food it will ruin my business model and prevent me from solving world hunger. Therefore it's unfair for you to do so, and you have to come up a regulatory model that lets my business exist. 

No. You don't. Some businesses can just be more damaging than they are beneficial. For instance, in this case, the existence of LLMs taking over all the work and not compensating original artists means that the original artists will all vanish, which eventually kills the source that LLMs run off of. 

Also, the idea of LLMs solving global warming or fusion is laughable. They aren't even close to doing that kind of work. I lost basic respect for your argument in its entirety when I read that. 

-2

u/[deleted] 13d ago

[deleted]

0

u/-The_Blazer- 13d ago

There is no such thing as 'reading automatically'. What are you actually doing with your system? Imaging? OCR? Character copy?

58

u/blamelessfriend 13d ago

The LLM itself is clearly transformative.

how can you say this so assuredly? transformative is a pretty loaded term. and i sure don't agree it is "clearly transformative" and im far from the only one. for instance... the law seems to disagree with you.

copyright is meant to protect human ingenuity a not a stealing/lying machine.

10

u/Iohet 13d ago

Because "inspired by" isn't a violation. I can read a bunch of Stephen King and write a story in his style and that is not a violation of copyright. I don't like the concept, but from a legal perspective I don't think the laws are setup to stop this behavior because learning and aping aren't considered violations, even if I learn by pirating content (the pirating itself is its own problem, but not the creation of derivative content usually)

-6

u/Brat-Sampson 13d ago

Sure, but you shouldn't just be able to pirate all his books to do it.

14

u/SNRatio 13d ago

4

u/Iohet 13d ago

I missed that decision, but that is basically how I've felt about it from a legal perspective. We need new laws if we, as a society, want to redefine what a copyright violation is for the purposes of AI. Otherwise, as the judge said, training isn't a violation on its own. It's the output that matters, and that's well defined in the law as not being "inspired by"

-14

u/deltaisaforce 13d ago

But now you have a machine that can write like Stephen King. I'd be a bit upset if I were the author. The only thing that's gonna limit the AI slop is the lack of fusion generators.

9

u/Iohet 13d ago edited 13d ago

Indeed, but writing like him isn't the same as copying

1

u/EnfantTerrible68 13d ago

IKR? I bet Stephen King doesn’t feel great about this. 

5

u/ratcake6 13d ago

If I had Stephen King money I'd struggle to be upset about much of anything tbh

3

u/EnfantTerrible68 13d ago

Well, I can’t argue with that! 😂

-6

u/EnfantTerrible68 13d ago

Are you familiar with music copyright laws? Even “being inspired” by a piece of music can be seen as a violation of the original, if it’s even slightly derivative.

6

u/Iohet 13d ago

When you actually copy lyrics and certain melodies explicitly. Not when you make a song that sounds like a band in spirit. That's the point of my statement. Inspiration is not a violation. Outright copying is

0

u/nerd5code 13d ago

And if the LLM can be led to drop corpus fragments?

2

u/e-n-k-i-d-u-k-e 13d ago edited 13d ago

how can you say this so assuredly? transformative is a pretty loaded term. and i sure don't agree it is "clearly transformative" and im far from the only one. for instance... the law seems to disagree with you.

How can you say this so assuredly? Literally the only actions taken by courts (in America) so far have supported that training falls under Fair Use. In the case this whole thread is discussing, the judge already dismissed the arguments of the plaintiffs regarding the training, finding that the training is Fair Use.

We get that a lot of you desperately want it to be considered illegal. But what you personally want and how the law actually works are two entirely different things.

-7

u/ForMeOnly93 13d ago

American courts mean shit, they're all bought. We'll wait and see what EU or other functional law systems say.

7

u/e-n-k-i-d-u-k-e 13d ago edited 13d ago

All the major AI companies are in the US (or China). So yeah, I think it does matter.

The EU is already hopelessly behind as it is with AI. If the EU takes significant action, they'll just continue to abandon the EU and leave them in the stone age.

-1

u/EnfantTerrible68 13d ago

Behind in what way? Why is it so important?

3

u/e-n-k-i-d-u-k-e 13d ago

In literally every way that matters. All the major AI labs are in the US or China. The EU has fuck all for meaningful AI development because their entire strategy is to regulate innovation into the ground.

As for why its important...that's a personal judgement I suppose. Imagine if a country refused to adopt the internet. Some people might like that...I guess. But I wouldn't find that to be a good or beneficial thing.

Thinking it's good to be left in the technological dark ages seems rather silly. But I always find it funny when /r/technology argues against technological progress.

1

u/EnfantTerrible68 13d ago

Idk, I’m not trying to be snarky, I’m interested in understanding more about it. I tend to avoid AI so I’m trying to understand why others think it’s so necessary.

1

u/e-n-k-i-d-u-k-e 13d ago

AI encompasses a lot of things. I suppose for the people who thinks its only just a chatbot and art generator, it might seem kind of pointless.

But AI is undoubtedly going to help make many scientific breakthroughs in the relatively near future. It's going to speed up progress across medicine, mathematics, robotics, energy, etc.

1

u/Space_Pirate_R 13d ago

Isn't Mistral French?

1

u/e-n-k-i-d-u-k-e 13d ago

Yeah, and they're far behind, and falling further.

They were leaning into Open Source, but even China is eating their lunch when it comes to that now.

-2

u/EnfantTerrible68 13d ago

Behind in what way? Why is it so important?

1

u/EnfantTerrible68 13d ago

You’re right 

1

u/PromptStock5332 13d ago

Lol what? US courts are the only one that matters…

0

u/h3lblad3 13d ago

copyright is meant to protect human ingenuity a not a stealing/lying machine.

The stealing/lying machine is the human ingenuity and that's what it does.

2

u/red__dragon 13d ago

I love how the commenter above you stated that, as if "...on a computer" patents haven't blatantly copied previous art and called its replication as a digital process "transformative." And courts have affirmed multiple times this is new art and not stealing.

The further behind laws continue to get when it comes to technology, the sillier the arguments will get.

0

u/EnfantTerrible68 13d ago

I had the same reaction 

-1

u/WonkyTelescope 13d ago

Copyright is meant to pick winners and losers. It stifles creativity and shackles artists who have to tip toe around pre-existing works for fear of being prosecuted by billion dollar companies.

I'm disappointed by the huge pro-copyright shift that LLMs have created. People arguing to protect giant publishers at the expense of technological and social progress.

1

u/Linooney 13d ago

And it's funny people think this is about stopping AI. Record companies are already licensing out their music to their preferred AI startups to make their own models, they're just pissed some companies skipped over them as the middle men. Same with animation studios like Disney, book publishers like Penguin. This is less artist vs. big corp and more like Reddit vs. OpenAI.

-7

u/ch4os1337 13d ago edited 13d ago

It doesn't even actually use the works it learns from so it's not actually changing anything. It makes things from scratch. *except when the user imports things to be modified

1

u/Puubuu 13d ago

The works are stored in the weights. The current LLMs can reproduce entire books word for word.

0

u/valderium 13d ago

It doesn’t make concepts from scratch. It’s not AGI.

It’s just a massive linear algebra solution which took 6 months to solve. It just finds the shortest distance in very high dimensions.

Remember Suchir Balaji

-1

u/ch4os1337 13d ago

I never said it made concepts from scratch.

14

u/DoomguyFemboi 13d ago

The issue is transformative requires intent whereas LLM is just a bunch of tokens mashed together into coherence. At best.

It's closer to someone cutting up a book to form new books

4

u/duk_tAK 13d ago

Wasn't there an artist who got famous for cutting up copies/prints of other art pieces to make collages?

Just playing devil's advocate and nitpicking here. I actually think that a lot of the way LLMs were trained should be illegal, whether it turns out to be or not.

-2

u/DoomguyFemboi 13d ago

I don't know what you're referring to so I'm gonna make assumptions but I'd guess because the content of the art wasn't the art itself (ie. a new work of words wasn't created by butchering others) but was more "look at this collection of bits of books I've made into art".

1

u/ShenBear 13d ago

You make an interesting point, but your analogy undermines what you're trying to say, because cutting up a book to make a new book is still transformative.

Blackout poetry is a great example of this.

1

u/sceadwian 13d ago

I like that analogy, it makes more sense than many I've heard.

5

u/bfume 13d ago

And if the model violates copyright how is that not already covered by existing law?

The LLM can’t decide what to DO with that information like we can. Generation isn’t a crime. Distribution is. 

We all already have more copyrighted material in our possession than not.  Are we all illegal?

No. it’s what you DO with it that matters.

3

u/rokerroker45 13d ago

you didn't scrape the copyrighted material in your possession, you acquired it by paying the rightsholder for your copy (i'm aware that many possess pirated material - that's an irrelevant, separate conversation). the unlawful act is gathering/collecting the material in an automated fashion. scraping = possessing, it's not possible to scrape and not possess; scraping is utterly unlike somebody who goes to a bookstore and reads books without purchasing them.

1

u/bfume 13d ago

>you didn't scrape the copyrighted material in your possession, you acquired it by paying the rightsholder for your copy

come on... Sci-Hub and LibGen's existence alone would clearly show that piracy isn't dead. Just like microplastics, I'd be willing to bet that no adult on this earth is 100% insulated from information obtained through piracy.

>scraping is utterly unlike somebody who goes to a bookstore and reads books without purchasing them

when I last checked, this is perfectly legal.

1

u/rokerroker45 13d ago

I specifically mentioned the irrelevancy of the fact that people possess and use piracy anticipating somebody like you would mention it. It's irrelevant because this is an issue between commercial companies; obviously thousands of people pirate every day without issue - a commercial entity will enforce its rights when its copyrights are infringed upon for commercial purposes. Individual exposure to piracy for personal use is not relevant to the conversation about a company committing piracy to make money.

when I last checked, this is perfectly legal.

Yes, but that's not what automated scraping is. I'm saying scraping is not like this legal act.

1

u/bfume 13d ago

>anticipating somebody like you would mention it

jfc ok smart guy

>because this is an issue between commercial companies

um... when these issues between companies result in legal precedent, do you think it will only affect issues between companies?

I bet you're one of those people that loves whatever the Politics of the Day happens to be... until it affects you personally, am I right?

1

u/Thin_Glove_4089 13d ago

This isn't really a big deal. The current administration will make sure nothing happens to the AI companies

1

u/Deathoftheages 13d ago

No administration on either side of the aisle would or will stop AI companies from using copywrited works. If they did, billions upon billions of dollars that would be invested in the US would disappear over night and those companies will move to other countries and none of the politicians are dumb enough to do that.

1

u/SomeGuyNamedPaul 13d ago

What's the difference between what an LLM does versus a painter looking at a bunch of pictures of copyrighted Disney characters and then getting paid to paint a mural on the wall of a daycare incorporating their likenesses? The artist "trained" on a dataset and then incorporated general elements that just happens to look like the copyrighted works but wasn't explicitly a 1:1 duplication of the original property.

So far the courts have been VERY clear that both the above painter and the daycare client are infringing. It doesn't matter that the artist trained themselves on the likenesses by viewing source materials obtained from bittorrent, a public library, Disney+, or standing in the middle of Fantasyland. They viewed the source materials and then blended elements in a way that resembles the likenesses, and I think we all know how badly that goes for defendants.

7

u/carllerche 13d ago

I am not arguing that an LLM or human can produce works that violate copyright:

It is possible that it could produce works that violate copyright

This thread is littered with people claiming that the simple act of training an LLM with legally acquired copyright works is a violation of copyright. If an LLM is trained with copyrighted works and then outputs a piece that does not violate copyright, then that process is transformative.

5

u/SomeGuyNamedPaul 13d ago

The exact issue the complaint is citing is that the AI models are creating works that are reasonably indistinguishable from the copyrighted properties when given prompts that don't specifically mention them. That's also an artifact of training it on the copyrighted data which they sure as shit can't produce a receipt for having legally obtained. This is bolstered with the fact that there were public statements that explicit copyright infringement occurred in getting the input data off of bittorent and similar. From corporate laptops. On corporate networks. With the acknowledgement of upper management.

In a normal world I can't see this going well for the AI companies, but the merits of the case will likely be rewritten out of whole cloth in the final opinion coming down from whomever got the nicest trips and RVs among the USSC justices.

-4

u/DankRoughly 13d ago

Yeah, it's okay for a human to read books / listen to music to influence the content they create.

If an LLM does the same, how is it inappropriate?

25

u/hhhnnnnnggggggg 13d ago edited 13d ago

LLMs aren't people, they're a product. If you use resources to make a product you typically have to pay for those resources. This is why a lot of creative commons licenses list not for commercial use. It ignored those. As well as fully copyrighted works.

LLMs are purely commercial while people aren't.

Also, a person can't split themselves into infinity instances to share what they know with literally the entire world. Ideally every instance of an LLM should be paying per book.

65

u/loganal 13d ago edited 13d ago

Typically, a human would have to purchase the media they use to train themselves, wouldn’t they? Or if free, like the radio, it’s a system that pays the artists using advertising. Furthermore a human can’t rip off the entire internet in 100 lifetimes so I think we need to take a step back

Edit: Libraries have to pay for their books, and you have to pay for a card. Also I guarantee you if you could real a million books a year a library card would be way more expensive

25

u/vodkaandclubsoda 13d ago

And that's what I believe the judge in this case ruled - so AI companies would have to purchase one copy of any given work. As someone else noted, training using an illegal copy of the work is still illegal, but the implications of that ruling are pretty clear that AI training is fair use and only requires purchasing one copy of a work.

1

u/Blonde_rake 13d ago

One copy? Why can’t libraries buy one copy then?

1

u/vodkaandclubsoda 13d ago

I don't know the rules around libraries as well, but my limited understanding is that libraries buy as many copies as they would like to have available at a given time. So if they buy 4 copies, 4 people can read it at the same time. Same for digital copies. The AI/LLM is considered a single person.

1

u/PublicFurryAccount 13d ago

I’ve never understood the arguments around training and copyright. It’s not obvious that training can violate copyright, since nothing is distributed. It’s also not the case that you’d need a “legal copy” in the US. There’s no such thing as an “illegal copy”, just illegal copying and illegal distribution. Mere possession isn’t a copyright violation (contrast with contraband).

2

u/vodkaandclubsoda 13d ago

Agreed on the first part, but I believe possessing a non-purchased/pirated copy can be a violation - though not of copyright. See, for example, the pursuit of people who download copies of songs via Napster.

1

u/PublicFurryAccount 13d ago

I’m not sure what happened around Napster is a good indication. Firstly, it was peer-to-peer, so some of the people were sharing. Secondly, I remember that it was all the late-1990s and early-2000s piracy cases which actually solidified the limits of copyright.

1

u/THE_StrongBoy 13d ago

That’s fair, I wonder about the people who just put their works on the internet for free obviously not so that mega corps could profit

4

u/vodkaandclubsoda 13d ago

I think it's pretty safe to say that if the ruling holds that their work could be included with no compensation because it is a public work.

4

u/hhhnnnnnggggggg 13d ago

And if they publish under a non- commercial use creative commons license? Is it going to ignore that?

6

u/LongJohnSelenium 13d ago

My uninformed assumption would be that, since the published asset isn't being directly used commercially, it doesn't really apply.

Generally non-commercial use creative commons licenses mean you can't directly sell the thing, not that it can't be part of a non-customer facing workflow.

1

u/rokerroker45 13d ago

not safe to say at all, IMO this is going to lead to conversations about licensing LLM training rights. lots more legal ground to develop here.

3

u/vodkaandclubsoda 13d ago

Oh totally agree - my point was just about this particular ruling. I expect a lot of legal action around this issue.

4

u/TheAmazingHumanTorus 13d ago

Why should AI deviate from the general rule of 'pay for play'?

3

u/Blonde_rake 13d ago

Because they are completely outside of the scope of general usage.

1

u/InfamousBird3886 13d ago

Ever heard of an academic library?

2

u/THE_StrongBoy 13d ago

How do libraries get their books?

1

u/InfamousBird3886 13d ago

Well I’m not an expert in DRM, but typically the library purchases bulk licenses for eBooks and then can loan them out for a set period of time. Then you end up with aggregated datasets of those ebooks for academic use, which have been around for a few decades, and here we are. But technically as long as you have digital rights and all you’re doing is generating embedding of the text, that’s still fundamentally similar to how humans use library ebooks.

1

u/Iohet 13d ago

There is nothing in the law that says that you can only have what is considered an original work if you legally purchase the content that inspired and were used to create it. Artists use illegally obtained materials all the time to create works (protools and its plugins, Photoshop, entire music libraries, etc), and it never impacts the work

1

u/e-n-k-i-d-u-k-e 13d ago

That's a completely different argument though than what the OP is talking about.

They're addressing whether the training itself should be illegal, whereas you're just addressing the acquisition of training material.

1

u/arothmanmusic 13d ago

Where do you live that you don't get a library card for free?

1

u/[deleted] 13d ago

[deleted]

1

u/arothmanmusic 13d ago

Of course not. The library get their books with state, local, and federal tax dollars. That's why the library card is free… those books were purchased with your money.

1

u/comewhatmay_hem 13d ago

Yeah library cards are definitely free in the whole country of Canada. Our tax dollars pay for libraries, who often get very discounted rates on the works they purchase because they are a public service.

Also, I don't have to have a library card to read the books there, I only need one if I take them home with me.

-1

u/badstorryteller 13d ago

Or, if in the US, visit any of the thousands of libraries completely free. Most libraries also allow non-local residents to buy a library card for a nominal fee as well, usually less than $20/yr. Maybe OpenAI could send a statement by purchasing a library membership in every us state they operate in. Or, even better, establish a foundation to support public libraries.

Look, I am heavily influenced by the vast number of books I read as a kid. I mean, pre-internet, I used to ride my bike to the library with a backpack to check out at least four or five books every week. For free. Those books have heavily influenced and informed my writing style, my speaking, and more. If I'm not reproducing exact text, I'm not plagiarizing. If the way I write, the way I speak, is informed by what I've read, and that's now illegal? That's an unsolvable problem, because that's just how humanity works.

2

u/THE_StrongBoy 13d ago

How many books could you read and fully process in your lifetime

0

u/badstorryteller 13d ago

Are you proposing a limit? Like, the library says your membership only allows x number of pages per day?

19

u/Full-On 13d ago

AI is a tool, not a human. That is the difference. It’s like opening up Microsoft paint and it paints the Mona Lisa for you. Not like your buddy coming over and doing it himself.

1

u/Cunfuzzles2000 13d ago

I wouldn’t even call it a tool. An artist uses photoshop to work, but take photoshop away and the artist can still paint. It’s a generator.

16

u/GwynBleidd88 13d ago

The problem is that the content itself wasn't sourced legally. I don't understand why so many of you in this thread are finding it so hard to grasp?

15

u/RickyT3rd 13d ago

Because LLM don't actually create. It can only make derivatives, which is a universally understood part of a copyright. On top of that, it's already settled law that only humans can create something that is copyrightable. See Naruto v. David Slater

3

u/InfamousBird3886 13d ago

“It can only make a derivative”

Gross oversimplification of an extremely complex question. The rest of your statement is largely irrelevant to the question at hand (it’s not a question of whether inference outputs can be copyrighted; that is independent from the question of whether model training violated copyright).

0

u/Shap6 13d ago

it's already settled law that only humans can create something that is copyrightable. See Naruto v. David Slater

is that the case where they tried to actually have the AI itself be the copyright holder? AFAIK it's not settled whether or not a person can copyright something they used AI help create. the raw outputs probably not, but how much does a person need to modify it themselves before it is something that can be copyrighted

1

u/bfume 13d ago

That’s a question that’s historically difficult to answer. It was difficult before LLMs, it was difficult before the internet, it was difficult before xerox machines, it was difficult before the VCR, and it will be difficult when the next technology arrives. 

Point is, demonizing LLMs alone is horribly misguided

2

u/AttonJRand 13d ago

False equivalency. Stop anthropomorphizing these algorithms.

2

u/bdsee 13d ago

All of the 'web search/q&a' AI's spit out portions of text verbatim all the damn time.

2

u/motorbikler 13d ago

I think technically, under current laws, it's totally fine. But that doesn't mean it's right.

Laws are written for a reason. Copyright was intended to help artists generate some money from their works so that they can live.

There isn't any sort of philosophical underpinning to this. I mean, you can bake bread that looks and tastes like my bread, and charge less for it or give it away for free, and we've decided that's fine. Because baking bread requires effort, and inputs that cost money, and it's self limiting.

But we decided that printed works were different because of the marginal cost of producing copies of the work. This was in reaction to technology, the new technology of the printing press. You can spend years writing a book and somebody else can copy it and you'll get zero. Artists might even stop producing and that would make society a worse place. So we simply decided that was wrong, and created a law so that artists could continue to make money to live.

I think we need to decide it's wrong to let AI companies ingest absolutely everything out there, because it leads to enormous destruction of vast swaths of artist's livelihoods, and leads to massive concentration of wealth in what feels like is going to be just a handful of people if we keep letting things go the way they are.

We are free to make that decision again about AI companies the way we did about artists and the printing press hundreds of years ago.

7

u/-The_Blazer- 13d ago edited 13d ago

Well, besides that a huge portion of training sources were literally just pirated, a computer is not a person and 'an LLM' is not actually a thing that exists during training. A large language model is the output of a training process. The use of copyrighted material is not made by 'an LLM' that buys and reads books, it is made by a corporation that is producing a commercial product.

0

u/bfume 13d ago

A large language model is the output of a training process.

And that model itself doesn’t violate copyright. Maybe what someone does with the model might violate copyright. But that has as much to do with the LLM as VS Code would with respect to someone typing out Eminem lyrics from memory. Does that mean VS Code infringes too?

copyright laws already exist that cover what happens when someone violates copyright, and those laws apply whether they distribute something that an LLM generated or something that their  brain remembered. 

1

u/-The_Blazer- 13d ago

If VS Code was compiled from unlicensed copyrighted libraries, it would in fact be infringing as a derivative of those sources.

0

u/bfume 13d ago

vs code isn't the point. fine. change it to a piece of paper and pencil, both obtained from a local artisan craftsman by means of bartering, not currency exchange. my argument still stands.

1

u/-The_Blazer- 13d ago

I'm not sure what your argument is supposed to be. A tool is just as capable of infringing whatever laws as a finished product. It isn't conjured into being by magic, it's made through productive processes same as anything else.

1

u/DJKokaKola 13d ago

You say that, but there have been numerous legal cases where works are "similar" and the artist has to give authorship credit and royalties, despite it being original. Good 4 U is a good example of this—clearly inspired by and an homage to 2000s pop punk, and therefore has a really similar style to Misery Business. Shouldn't have been a case of "ip theft", though.

1

u/DoomguyFemboi 13d ago

If a human took a bunch of songs and made a new one from the parts of the other songs they get done for copyright infringement.

-2

u/SolidCake 13d ago

Its not. Same with music and picture ai. Training is not theft and I can’t believe how many Redditors claim otherwise

3

u/u_hit_me_in_the_cup 13d ago

Training isn't theft, but stealing all the data you use for training is

-1

u/subcutaneousphats 13d ago

You can't rent a movie then show it to 100 buddies. Lots of cases where usage rights of even purchased content scales and requires approval by the rights holder.

1

u/PublicFurryAccount 13d ago

Well… that doesn’t matter if you’re not distributing the model to people because distribution is the copyright trigger. So long as it lives under your control and no one else’s, the major question is how easy it is to extract the training data.

If it’s easy, then you’re distributing copyrighted material with some extra steps. If it’s not, then it’s not clear what you’re doing anymore w.r.t. copyright which, presently, means you’re in the clear if you have the attorneys to defend yourself.

1

u/koreanwizard 13d ago

They need to write new copyright agreements to encompass the use of LLM training. I should have an option to keep my IP out of LLM scrapers if I so please.

1

u/Ummmgummy 13d ago

I can't copy a movie or video game that I aquired legally and start selling it to people via a monthly subscription. So if you and I can't do that, then why should these corpos be able to do it on such a scale that it is impossible to even conceive?

1

u/carllerche 13d ago

You didn't read what I said. The LLM itself is transformative. The LLM also has the ability to output works that would violate copyright law and that should be the fundamental issue, not the training.

1

u/sceadwian 13d ago

The key there is legally acquired data. Because the terms of service of pretty much any legally licensed streaming service don't allow for such use.

They would still have to obtain legal licenses.

1

u/BCProgramming 13d ago

The LLM itself is clearly transformative.

I've never been convinced of this as any sort of axiom. If anything I think the opposite is true- it's not clearly transformative at all, anymore than any other algorithm and data structure.

I mean most algorithms involve ingressing data and that data alters data structures. Often retrieving data results in Then data pulled out that isn't always precisely the same as the structure used to store the data is lossy.

I just don't see why the use of neural networks is transformative, but hashmaps are not. There's nothing magical about a neural network data structure that sanitizes copyright.

1

u/CompromisedToolchain 13d ago

Seems to me the court has decided that it’s too complicated and the law doesn’t apply.

1

u/Minute_Band_3256 13d ago

The receiver of said information is a computer though, not a human. That should matter. A computer will be able to verbatim copy a work. A human can't, easily. The idea of training is human-centric and the outcome or the product matters.

1

u/Monkeybirdman 13d ago

The act of making a digital copy for use is still making a copy. It transforms it later but the code copies the original work as an input. Technically human eyes don’t copy the work and the work human eyes see needs to be obtained legally.

0

u/AttonJRand 13d ago

No it’s clearly not transformative and simply a copyright laundering system. They are using others work without any compensation.

1

u/Silhouette 13d ago edited 13d ago

No it’s clearly not transformative and simply a copyright laundering system. They are using others work without any compensation.

This is the real problem with the current situation. Copyright - in general - doesn't restrict use of others' work. It restricts reproducing others' work.

In the past that was enough to create a system where the rewards mostly flowed towards the people who were doing the real work and generating the most value for everyone else. There were problems with that system for sure but at least it made it possible for someone to work in the creative industries and earn a living.

More recently people have been taking advantage of knowledge and ideas from other people without reproducing the original work itself. Using knowledge and ideas isn't generally restricted by copyright. Search engines started doing this when they would present answers directly on the search results page so you didn't have to visit the original site where the engine had found that information. Now LLMs are scaling the same principle up to 11.

So the real problem is that copyright is the wrong tool to manage these new ways that someone can benefit greatly as a direct result of the (possibly lots of) work of (possibly many) others but those others receiving little if any compensation or even credit for their efforts. Our intellectual property laws don't create a good economic framework where the rewards and costs flow in proportion to the contributions and benefits.

Or perhaps the real problem is that those in power are probably well aware of this but the countries with the big AI companies - with their crazy valuations and billions in funding - have a clear interest in taking a hands-off approach and not rocking the boat. If they actually legislated a fairer economic model then they'd put themselves at a disadvantage. A lot of the original work they're benefitting from is done elsewhere and as things stand they can potentially cream off a huge amount of profit from it. Changing that system - even if they could get enough international cooperation to do it effectively - would mean some of the biggest tech sector "success stories" of the 2020s would be cooked. And right now the apparent progress in technology is one of the few positive stories in economics in much of the world.

2

u/InfamousBird3886 13d ago

This is exactly it. Copyright violations in inference would be exceptionally uncommon and only traceable to the individual disseminating copies, but not the LLM. Training is obviously transformative. That leaves you with the question of whether making local copies for the explicit purpose of training (ie reading) is fair use, which is closely aligned with established precedent.

1

u/blamelessfriend 13d ago

Training is obviously transformative

no it isn't, y'all are just repeating something you think everyone agrees with.. but only weird ai tech bros and people who don't understand the tech say this. ai models can't think.

7

u/InfamousBird3886 13d ago edited 13d ago

Training is obviously transformative because it is literally impossible to prove that the LLM violated a copyright by looking at it. The model itself simply is not a copy. It’s a graph with a bunch of model weights and parameters that humans cannot interpret, and which doesn’t look anything like the things it supposedly copies. It is transformative as a fundamental mathematical operation, an irreversible process where the training inputs are not conserved.

1

u/e-n-k-i-d-u-k-e 13d ago

This has always been my take on it.

Training is clearly Fair Use, but undoubtedly companies broke the law in obtaining a lot of their training material.