Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search resu

ooli@lemmy.world · 3 months ago

Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search resu

DarkThoughts@fedia.io · 3 months ago

Are we maybe talking about 57% of newly created content? Because I also have a very hard time believing that LLM generated content already surpassed the entire last few decades of accumulated content on the internet.

Ephera@lemmy.ml · edit-2 3 months ago

I’m too dumb to understand the paper, but it doesn’t feel unlikely that this is a misinterpretation.

What I’ve figured out:

They’re exclusively looking at text.
Translations are an important factor. Lots of English content is taken and (badly) machine-translated into other languages to grift ad money.

What I can’t quite figure out:

Do they only look at translated content?
Is their dataset actually representative of the whole web?

The actual quote from the paper is:

Of the 6.38B sentences in our 2.19B translation tuples, 3.63B (57.1%) are in multi-way parallel (3+ languages) tuples

And “multi-way parallel” means translated into multiple languages:

The more languages a sentence has been translated into (“Multi-way Parallelism”)

But yeah, no idea, what their “translation tuples” actually contain. They seem to do some deduplication of sentences, too. In general, it very much feels like just quoting those 57.1% without any of the context, is just a massive oversimplification.

Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search resu

Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search resu

Sam Altman indicated it's impossible to create ChatGPT without copyrighted material, but a new study claims 57% of the content on the internet is AI-generated and is subtly killing quality search results