There isn’t any open source solution possible if AI models are beholden to copyright laws.
This is advocating for a world where only a handful of companies would be able to train AI models, and the rest of us would become their pets as we move towards an AI driven society.
The artist and writers already lost, there is no going back. Now we see if we all win together or if only google, openai, shutterstock, Adobe, stack overflow, github and reddit win since they are the only ones with the data or able to pay for it.
The way I see it, the software can be open source, but you’d have to train it yourself.
Kind of like how you’re free to reverse engineer a console, and write an open source emulator, but you can’t supply the firmware itself (ex scph1000.bin for ps1) or roms of commercial games.
The pretrained part is just someone running the software on their dataset for you. You are free to do the same yourself, and getting the data for the training set legally is an exercise for you. Is it affordable for most people? Not really, because you need gargantuan amounts of data and compute power. But the software itself is yours to modify and run. I see that as an indication of the technology being a dead end, in the long run. As in, they are not getting much better, but they are becoming much larger and much less feasible to train.
I’ve thought about this regarding code as well: An AI is nothing without a training data set, if someone uses licensed code to train an AI, they should definitely be bound by the license. For example: If an AI is trained using copyleft licensed code, the resulting model should also be regarded as copyleft bound. As of now, I suspect this is to a very large degree being ignored.
Sure, but that particular horse has left the barn. There will be cases where identification is easy(-ier) but as shown in Oracle v Google, there are only so many ways to express ideas in code.
For example, I just asked Claude 2 “Write a program in C to count from 1 to some arbitrary number specified on the command line.” Can you tell me the origin of this line from the result?
for(int i=1; i<=n; i++) {
I mean, if it’s from a copyrighted work, I certainly don’t want to use it in an open-source project!
EDIT: Guessing there’s a bug in HTML entity handling.
Of course, once the AI is trained, you can’t look at some arbitrary output and determine whether that specific output came due to some specific training data set. In principle, if some of your training data is found to violate copyrights you either have to compensate the copyright holder or re-train the model without that data set.
Finding out whether a copyrighted work is part of the training data is a matter of going through it, and should be the responsibility of the people training the model. I would like to see a case where it has been shown that a copyrighted dataset has been used to train a model, and those violating the copyright by doing so are held responsible.
I agree that under the current system of “idea ownership” someone needs to be held responsible, but in my opinion it’s ultimately a futile action. The moment that arbitrary individuals are allowed to download these models and use them independently (HuggingFace, et al), all control of whatever is in the model is lost. Shutting down Open AI or Anthropic doesn’t remove the models from people’s computers, and doesn’t eliminate the knowledge of how to train them.
I have a gut feeling this is going to change the face of copyright, and it’s going to be painful. We collectively weren’t ready.
It’s not over and done with. Pass regulation saying every AI accessible w/in the country has to have a publicly available dataset. That way people can see if their works have been stolen or not. When we inevitably see works recreated wholesale without proper copyright, the AI creators can be sued or fined.
Couple of things here - what do you do with the open source models already published? There’s terabytes of data encapsulated in those. Some have published corpora, some don’t. How do you plan to determine that a work comes from an unregistered AI?
Also, with respect to “within the country” - VPNs exist. TOR exists. SD cards exist. What’s your plan to control the flow of trained models without violating civil rights?
This is a teflon slope covered in oil. (IMO)
If they don’t publish what their training data is, they should be considered violating copyright. The world governments can block sites if they want. It’s hard to swat down all of the random wikis and such but major AI competitors wouldn’t be a big problem.
“Innocent until proven guilty” is a rather important foundation for most justice systems. You’re proposing the exact opposite.
That way people can see if their works have been stolen or not.
Firstly, nothing at all is being “stolen.” The words you’re looking for are “copyright violation.”
Secondly, it does not currently appear that training an AI model on published material is a copyright violation. You’re going to have to point to some actual law indicating that. Currently that sort of thing is generally covered by fair use.
See:
Github Autopilot controversy
I’m absolutely on the side of the artists here, but I do wonder if the AI company’s defense will be that the software is no different than another artists drawing inspiration from earlier works. Every art student studies the masters and has assignments to produce works in their style, and current artists have absolutely been influenced by contemporaries. No one evolves their creative style in a vacuum: that’s impossible, short of living on a deserted island.
But this is a fundamentally different problem since the AI can produce millions of tailored works quickly, replacing vast numbers of creatives, threatening their livelihood. That’s not as much of a concern with one-off artists creating things similar to something they saw earlier (although the individual concept may be the same).
This is going to be a really interesting legal case.
What AI does lands more on “tracing” side than “referencing” side though
My experience with image AI gave me almost the exact opposite feeling, more like it somehow pinpoints important aspects of a certain style or artist and then it can just jam with that limitlessly (Dall-e AI in this case) . How did you find it closer to tracing? Did you play around with any of the image AIs?
That’s not true at all. AI uses latent noise as a medium to draw images, there’s nothing left of the original image in its dataset.
Legitimately, it’s like these people have no understanding of the actual technology.
The other response you’ve received talked about a very small subset of overtrained images, which makes sense on why they can be replicated. anyone who trained on creating a specific image a million times would be able to replicate that image easily. Even then it takes a lot of luck and effort to accurately replicate the exact image to any degree.
If you are not specifically trying to recreate an overly popular image, then there is practically no element left from any particular image that you can consider represented to any thieving extent.
Considering that it is effectively acting on a pareidolia interpretation of static represented by countless possible prompt and setting combinations, the copyright issue should only really be relevant when people use the tool specifically trying to recreate a particular work. Literally any other paint program would be more effective for that style of theft.
As an artist, in regards to the pareidolia aspect, I do virtually the same thing when illustrating an image. Disney/Warner can already afford as many peasants to learn or recreate whatever styles they want. I can’t afford a team of lackeys. I can however use an open source diffusion model to create entirely unique and personally tailored and designed illustrations that suit my artistic objective.
Existing concept of copywrite does not work for this scenario, and if people should argue anything, it should be that wealthy businesses specifically have much more restriction and responsibility in use of tools and in excessive control of the artistic market.
I’m personally excited for a future where peasant artists can also create complex beautiful works using these tools.
Think about ending up with holodeck level of personal creative freedom, and being able to create things in that experience the you can share with others.
The current system already robs and suppresses actual art.
Just like every other aggressive reaction to AI, the focus is misdirected and not actually helpful for anyone in any way.
There’s usually nothing left of the original image. But sometimes a specific image pops up in the dataset more often and gets overtrained, which is why you can get a pretty close copy of the Starry Night from vanilla SD. But yeah, it’s not tracing.
Those instances are considered a flaw and trainers work hard to prevent them. When they do occur you have to know they’re in there in order to dredge them back out.
Yes, at best the AI works would still be infringing derivative works. If a human made that art and tried to make money off it, courts would almost assuredly say it lacked “sufficient tranformative creative effort” to allow it to be copyrighted itself or protect it from being considered an infringement. There’s a big difference beween “inspired by” and “trying to copy”.
Further, if all these works were being used for non-commercial purposes, like, just to print and hang up in their homes or something, it would still suck for artists (because they would lose the individual end-sale market) but it wouldn’t be nearly as harmful. The big problem is that people and corporations are currently trying to use AI art to sidestep paying creatives for their work and then using that AI generation for commercial purposes or to loophole the art out of things like Patreon. It’s a deliberate attempt to deprive hardworking creatives of the money they are due for their work.
I would argue that it is not the work produced by the AI, but the trained model itself, which infringes on copyright.
The model cannot be regarded as an artist, but as a product, commercial or otherwise, that has been created by stealing copyrighted work.
Looking at it a bit simplified, ask the AI to produce a number of pictures and videos in the style of Disney and you and the AI builder will get slammed by a lawsuit. Copyright still matters if you’re big enough.
I’m sure they can also develop an AI to analyze the similarities between works and pay a small amount of royalties to the author(s) based on the ratio of that similarly above a certain cutoff but before that happens someone big enough needs to sue first.
Style cannot be copyrighted.