OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series

L4sBot@lemmy.world · 3 years ago

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series

Cyfuture AI@lemmy.world · 8 months ago

OpenAI has stated that its models were trained on publicly available and licensed data. There is no confirmed evidence that ChatGPT was specifically trained on copyrighted books like J.K. Rowling’s Harry Potter series. The company has not disclosed the full details of its training data.

rosenjcb@lemmy.world · edit-2 3 years ago

The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It’s historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

Thorny_Thicket@sopuli.xyz · 3 years ago

I don’t get why this is an issue. Assuming they purchased a legal copy that it was trained on then what’s the problem? Like really. What does it matter that it knows a certain book from cover to cover or is able to imitate art styles etc. That’s exactly what people do too. We’re just not quite as good at it.

Hildegarde@lemmy.world · 3 years ago

A copyright holder has the right to control who has the right to create derivative works based on their copyright. If you want to take someone’s copyright and use it to create something else, you need permission from the copyright holder.

The one major exception is Fair Use. It is unlikely that AI training is a fair use. However this point has not been adjudicated in a court as far as I am aware.

LordShrek@lemmy.world · 3 years ago

this is so fucking stupid though. almost everyone reads books and/or watches movies, and their speech is developed from that. the way we speak is modeled after characters and dialogue in books. the way we think is often from books. do we track down what percentage of each sentence comes from what book every time we think or talk?

SpiderShoeCult@sopuli.xyz · 3 years ago

Aye, but I’m thinking the whole notion of copyright is banking on the fact that human beings are inherently lazy and not everyone will start churning out books in the same universe or style. And if they do, it takes quite some time to get the finished product and they just get sued for it. It’s easy, because there’s a single target.

So there’s an extra deterrent to people writing and publishing a new harry potter novel, unaffiliated with the current owner of the copyright. Invest all that time and resources just to be sued? Nah…

Issue with generating stuff with 'puters is that you invest way less time, so the same issue pops up for the copyright owner, they’re just DDoS-ed on their possible attack routes. Will they really sue thousands or hundreds of thoudands of internet randos generating harry potter erotica using a LLM? Would you even know who they are? People can hide money away in Switzerland from entite governments, I’m sure there are ways to hide your identity from a book publisher.

It was never about the content, it’s about the opportunities the technology provides to halt the gears of the system that works to enforce questionable laws. So they’re nipping it in the bud.

LordShrek@lemmy.world · 3 years ago

this brings up the question: what is a book? what is art? if an “AI” can now churn out the next harry potter sequel and people literally can’t tell that it’s not written by JK Rowling, then what does that mean for what people value in stories? what is a story? is this a sign that we humans should figure something new out, instead of reacting according to an outdated protocol?

yes, authors made money in the past before AI. now that we have AI and most people can get satisfied by a book written by AI, what will differentiate human authors from AI? will it become a niche thing, where some people can tell the difference and they prefer human authors? or will there be some small number of exceptional authors who can produce something that is obviously different from AI?

i see this as an opportunity for artists to compete with AI, rather than say “hey! no fair! he can think and write faster than me!”

FatCat@lemmy.world · 3 years ago

It is not a derivative it is transformative work. Just like human artists “synthesise” art they see around them and make new art, so do LLMs.

BURN@lemmy.world · 3 years ago

LLMs don’t create anything new. They have limited access to what they can be based on, and all assumptions made by it are based on that data. They do not learn new things or present new ideas. Only ideas that have been already done and are present in their training.

Default_Defect@midwest.social · 3 years ago

They made it read Harry Potter? No wonder its gonna kill us all one day.

Uriel238 [all pronouns]@lemmy.blahaj.zone · edit-2 3 years ago

Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.

The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.

The problem is this is a shitty, unethical way to determine who gets to survive and who doesn’t. All the current controversy about generative AI does is kick this can down the road a bit. But we’re going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.

Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.

Skanky@lemmy.world · 3 years ago

Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.

uzay@infosec.pub · 3 years ago

I hope OpenAI and JK Rowling take each other down

Touching_Grass@lemmy.world · 3 years ago

What’s the issue against openAI?

BURN@lemmy.world · 3 years ago

They’re stealing a ridiculous amount of copyrighted works to use to train their model without the consent of the copyright holders.

This includes the single person operations creating art that’s being used to feed the models that will take their jobs.

OpenAI should not be allowed to train on copyrighted material without paying a licensing fee at minimum.

uzay@infosec.pub · 3 years ago

Also Sam Altman is a grifter who gives people in need small amounts of monopoly money to get their biometric data

LifeInMultipleChoice@lemmy.ml · 3 years ago

So hypothetical here. If Dreddit did launch a system that made it so users could trade Karma in for real currency or some alternative, does that mean that all fan fictions and all other fan boy account created material would become copyright infringement as they are now making money off the original works?

Touching_Grass@lemmy.world · 3 years ago

If they purchased the data or the data is free its theirs to do what they want without violating the copyright like reselling the original work as their own. Training off it should not violate any copyright if the work was available for free or purchased by at least one person involved. Capitalism should work both ways

BURN@lemmy.world · 3 years ago

But they don’t purchase the data. That’s the whole problem.

And copyright is absolutely violated by training off it. It’s being used to make money and no longer falls under even the widest interpretation of free use.

GroggyGuava@lemmy.world · edit-2 3 years ago

You need to expand on how learning from something to make money is somehow using the original material to make money. Considering that’s how art works in general, I’m having a hard time taking the side of “learning from media to make your own is against copyright”. As long as they don’t reproduce the same thing as the original, I don’t see any issues with it. If they learned from Lord of the rings to then make “the Lord of the rings” then yes, that’d be infringement. But if they use that data to make a new IP with original ideas, then how is that bad for the world/ artists.

BURN@lemmy.world · 3 years ago

Creating an AI model is a commercial work. They’re made to make money. Now these models are dependent on other artists data to train on. The models would be useless if they weren’t able to train on anything.

I hold the stance that using copyrighted data as part of a training set is a violation of copyright. That still hasn’t been fully challenged in court, so there’s no specific legal definition yet.

Due to the requirement of copywritten materials to make the model function I feel that they are using copyrighted works in order to build a commercial product.

Also AI doesn’t learn. LLMs build statistical models based on sentence structure of what they’ve seen before. There’s no level of understanding or inherent knowledge, and there’s nothing new being added.

Touching_Grass@lemmy.world · 3 years ago

How do they get the data if its not purchased or freely available

BURN@lemmy.world · 3 years ago

It may be freely available for non-commercial works, eg. Photos on Photobucket, internet archive free book archives, etc.

Most everything is on the internet these days, copyrighted or not. I’m sure if I googled enough I could find the entire text of Harry Potter for free. I still haven’t purchased it, and technically it’s not legally freely available. But in training these models I guarantee they didn’t care where the data came from, just that it was data.

I’m against piracy as well for the record, but pretty much everything is available through torrenting and pirate sites at this point, copyright be damned.

Touching_Grass@lemmy.world · edit-2 3 years ago

Don’t care, that’s not mine or these LLMs problem they don’t secure their copyright. They shouldn’t come asking for others to pay for them not securing their data. I see it as a double edged sword.

I really hope this is a wake up call to all creative types to pack up and not use the internet like a street corner while they busk.

If they want to come online to contribute like everybody else. Just have fun and post stuff, that’s great. But all of them are no different then any other greedy corporation. They all want more toll roads. When they do make it and earn millions and get our attention they exploit it with more ads. It swallows all the free good content. Sites gear towards these rich creators. They lawyer up and sue everybody and everything that looks or sounds like them. We lose all our good spaces to them.

I hope the LLM allows regular people to shit post in peace finally.

Corkyskog@sh.itjust.works · 3 years ago

They used to be a non profit, that immediately turned it into a for profit when their product was refined. They took a bunch of people’s effort whether it be training materials or training Monkeys using the product and then slapped a huge price tag on it.

Blapoo@lemmy.ml · 3 years ago

We have to distinguish between LLMs

Trained on copyrighted material and
Outputting copyrighted material

They are not one and the same

TwilightVulpine@lemmy.world · 3 years ago

Should we distinguish it though? Why shouldn’t (and didn’t) artists have a say if their art is used to train LLMs? Just like publicly displayed art doesn’t provide a permission to copy it and use it in other unspecified purposes, it would be reasonable that the same would apply to AI training.

Tetsuo@jlai.lu · 3 years ago

Output from an AI has just been recently considered as not copyrightable.

I think it stemmed from the actors strikes recently.

It was stated that only work originating from a human can be copyrighted.

Anders429@lemmy.world · 3 years ago

Output from an AI has just been recently considered as not copyrightable.

Where can I read more about this? I’ve seen it mentioned a few times, but never with any links.

Even_Adder@lemmy.dbzer0.com · 3 years ago

They clearly only read the headline If they’re talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

Even_Adder@lemmy.dbzer0.com · 3 years ago

Yeah, this headline is trying to make it seem like training on copyrighted material is or should be wrong.

scv@discuss.online · 3 years ago

Legally the output of the training could be considered a derived work. We treat brains differently here, that’s all.

I think the current intellectual property system makes no sense and AI is revealing that fact.

RadialMonster@lemmy.world · 3 years ago

what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

chemical_cutthroat@lemmy.world · 3 years ago

That’s why this whole argument is worthless, and why I think that, at its core, it is disingenuous. I would be willing to be a steak dinner that a lot of these lawsuits are just fishing for money, and the rest are set up by competition trying to slow the market down because they are lagging behind. AI is an arms race, and it’s growing so fast that if you got in too late, you are just out of luck. So, companies that want in are trying to slow down the leaders, at best, and at worst they are trying to make them publish their training material so they can just copy it. AI training models should be considered IP, and should be protected as such. It’s like trying to get the Colonel’s secret recipe by saying that all the spices that were used have been used in other recipes before, so it should be fair game.

beetus@sh.itjust.works · 3 years ago

Not knowing something is a crime doesn’t stop you from being prosecuted for committing it.

It doesn’t matter if someone else is sharing copyright works and you don’t know it and use it in ways that infringes on that copyright.

“I didn’t know that was copyrighted” is not a valid defence.

ClamDrinker@lemmy.world · edit-2 3 years ago

This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they’ve been doing for a while, hence why their AI models are known to be massively censored. I wouldn’t call that ‘hiding’. It’s kind of hard to hide it was trained on copyrighted material, since that’s common knowledge, really.

paraphrand@lemmy.world · 3 years ago

Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

Crozekiel@lemmy.zip · 3 years ago

AI is the new fan boy following since it became official that nfts are all fucking scams. They need a new technological God to push to feel superior to everyone else…

SCB@lemmy.world · 3 years ago

Leftists hating on AI while dreaming of post-scarcity will never not be funny

Whimsical@lemmy.world · 3 years ago

The dream would be that they manage to make their own glorious free & open source version, so that after a brief spike in corporate profit as they fire all their writers and artists, suddenly nobody needs those corps anymore because EVERYONE gets access to the same tools - if everyone has the ability to churn out massive content without hiring anyone, that theoretically favors those who never had the capital to hire people to begin with, far more than those who did the hiring.

Of course, this stance doesn’t really have an answer for any of the other problems involved in the tech, not the least of which is that there’s bigger issues at play than just “content”.

afraid_of_zombies@lemmy.world · 3 years ago

I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.