Copyright in the Era of Generative Artificial Intelligence
The first in a series of articles about machine learning and content monopolies
Hello again! I’ve been super busy working on two new projects that are gaining momentum. I plan to write about them in the next month or so. In the meantime, I want to share with you what I’ve learned in the past two months about copyright and artificial intelligence. The collision between the generative AI companies and the copyright industry provides a lens that brings several complex issues into focus. In this series of articles, I will share with you what I’ve learned about the history of copyright and how it pertains to this transformative moment that will shape our collective future. Ten articles will follow this first one. Please let me know if you find this information useful. Thanks! — RT
Generative AI Companies Are On A Collision Course with Copyright Owners
Generative artificial intelligence (GAI) systems perform what seems like a magic trick: in seconds, these systems manage to conjure up fresh paragraphs, images, and even music or animation, guided only by a few lines of text written by human users.
But this trick is not magic. To some people, it’s more like strip mining without a permit.
The large language models (LLMs) and other large models that provide the foundation for the GAI systems have been “pre-trained” by processing the work of hundreds of millions of human artists, authors, and creators. Typically, this training process occurs without credit, compensation, or the consent of the original human author.
To put this into perspective, let’s note that it is not unusual for a company to build an entirely new business by making use of someone else’s intellectual property. For decades, for example, TV networks were formed by aggregating TV shows that were owned by many other companies and presenting them to the public as a new bundle of entertainment value. This kind of creative repackaging and reuse requires permission from the owner of the original work, typically in the form of a licensing agreement that shares revenue with the owner.
No normal company would make use of copyrighted intellectual property without permission or payment.
In this respect, Generative AI companies differ sharply from broadcasting and, frankly, all other businesses that bundle or repackage media.
What’s unusual about generative AI is that the original content of millions of creators was used without permission to establish a new for-profit business that shares no revenue with the authors of the original work.
The current generation of LLMs and similar models were trained on data without even an attempt to license the source content or obtain permission to use it. Some AI companies are, unsurprisingly, reluctant to reveal how their systems were trained, or which sources were used. Content authors must play a guessing game to discover whether their work was used to pre-train the LLM.
The result is a generative AI that is capable of producing new works that may compete with, summarize, or even make obsolete the original works that it was trained on. The authors of the original works are unable to stop it; some may not even be aware that this is happening.
To many observers, this seems like a kind of theft, but the matter might not be so simple. The US Ninth Circuit of Appeals ruled years ago that scraping the web is legal. The court reaffirmed this ruling last year.
The prevailing opinion in the technology industry is that there is nothing wrong with training AI systems on publicly available data, even if that data consists of the work of human authors and artists. As executives at generative AI startups routinely quip, “There’s no law against reading a book or looking at artwork.” Later in a subsequent article, we will investigate the logic that supports their view.
This seems unfair to creators of artistic works. More than 8000 authors, including Suzanne Collins, Margaret Atwood, and James North Patterson, have signed a letter from the Author’s Guild that advises the AI companies to seek and secure permission before using copyrighted works for training AI systems. The Author’s Guild recommends a new collective licensing arrangement that will distribute compensation to authors for the use of their original work for use in training large language models (LLMs).
The technology firms point out that it may be difficult to attribute an economic value to the use of the original works in this new way, especially because such hundreds of millions of works are used in the training process. But there may be reason to believe that this matter can be resolved with a deal instead of a lawsuit: some AI companies have recently begun to strike licensing deals with major content providers, a signal that they are not entirely numb to the complaints from the media industry and content owners.
The intense controversy surrounding GAI systems raises questions about copyright, authorship, ownership, attribution, and fair use.
Today, the best (but far from perfect) mechanism available to mediate such disputes is the body of law that governs intellectual property. That’s why generative artificial intelligence has become a lightning rod for copyright controversy. Several infringement lawsuits have already been filed. The battle between copyright owners and AI firms will be fought in a courtroom.
Even if some of the early lawsuits get dismissed, more lawsuits are inevitable. In the US, these lawsuits will claim infringement of the Copyright Act of 1976, others may refer to the 1998 Digital Millennium Copyright Act (DMCA), and others might cite the Computer Fraud and Abuse Act (CFAA).
In recent weeks, I’ve had discussions with more than a dozen attorneys about generative AI; the consensus is that these suits represent just the first salvo what will soon become a grinding series of courtroom battles to define precisely how copyright applies to GAI.
If you are planning to incorporate generative artificial intelligence in your digital media projects, then you will certainly benefit from cultivating an awareness of how copyright works and how the use of intellectual property is governed. That’s the purpose of this series of reports.
The Privacy / Data debate
Let’s put this controversy into the broader context of our current political-economic moment. In the past decade there has been a surge in public debate about free speech, censorship, tech monopolies, open-source software, surveillance capitalism and the collection of personal data to build online information empires.
The questions “Whose data is it?” and “Is it data or content?” and “Who has the right to use it?” lurk behind these debates.
I think that most people would say that your personal data is something that naturally belongs to you; you should therefore have the right to control your data, and you should decide which companies have your permission to use it. But the trend in digital media seems to move in the opposite direction. You don’t own your digital data. Thousands of companies mine your Internet data without notifying you or seeking your permission.
Rules and regulations, like the European General Data Protection Regulation (GDPR), have not been entirely effective at deterring companies from hoovering up your data. The 50-page Terms of Use clickthrough agreements that govern access to large web sites have been revised continuously to expand the scope of rights granted to Internet platforms. It’s not just web sites, either; now connected cars, smart appliances, and even toys are capable of eavesdropping and amassing a record of your behavior, location and connections.
An individual person’s data is just one tiny building block in someone else’s globe-spanning information empire. At the scale of billions of users, however, those tiny blocks of personal data add up to an immensely valuable strategic asset that can be leveraged to generate vast wealth; and it presents a formidable barrier to rivals and latecomers who wish to compete.
Now the digital information empires have expanded their definition of “publicly available data” to include copyrighted material. The visionary leaders of the biggest tech companies do not see artwork, music, books, and films as the valuable product of the creative labor of millions of artists: what they see instead is underutilized “data” ripe for exploitation by machines.
Semantics make a material difference. Are we talking about “content” or “original works of art” or mere “data”? The term “data” is an abstraction that obscures the intellectual property issues; more precise terminology might reinforce the perception that the works in question are valuable private property.
Data-scraping by AI companies illustrates just how difficult it is to strike the perfect balance in copyright law: does the benefit to the public of training AI systems outweigh the rights of the owners of intellectual property? Which confers more benefit on society?
Copyright law seems to touch every controversial subject in digital media today. But it is not clear that copyright will provide the kind of protection that some authors and artists demand.
News events, such as Elon Musk’s acquisition of Twitter and the soaring valuations of Internet giants, have amplified and intensified this debate. When does my content become someone else’s data? In US politics, hot topics like ownership of data assets, surveillance capitalism, IP rights and free speech have become weapons in the culture wars that shape contemporary political discourse.
Governments have stepped in to mediate the controversies surrounding AI. The US Congress is considering legislation that will regulate the use of artificial intelligence. The European Union has drafted AI legislation, and Chinese government has already enacted such laws. Japanese and Israeli ministers have issued guidance that pertains to generative AI and copyright.
This activity is taking place in the context of increased government scrutiny of the technology sector. The US Federal Trade Commission has awakened from its Trump-era slumber with a renewed willingness to bring antitrust charges against tech giants (though we note that FTC Chairperson Lena Khan has so far failed to notch a big win in her quest against Big Tech monopolies). The European Union has imposed a series of massive fines on Internet companies that have violated EU laws that govern privacy and data. In China, homegrown tech operates under strict government oversight, and foreign Internet platforms are forbidden.
US tech industry leaders have appealed to Congress and reached out to legislators to discuss regulation of machine intelligence. Whatever action the US federal government takes to control AI will likely be contested in court and at the ballot box.
It therefore seems likely that these topics will remain at the forefront of debate about digital media and in American politics for the foreseeable future.
None of this is new
Given the hoopla about AI in the news, one might get the impression that these debates are relatively recent, or perhaps they are somehow uniquely calibrated to this moment in history.
But that’s far from the truth.
A glance at the historical record reveals that every successive wave of tech innovation has raised similar issues. Social turmoil and intense conflict have occurred during every century of technological innovation, from the invention of moveable type to the advent of the mechanical automation, to the rise of electronic broadcast media, to the introduction of Internet file sharing and the advent of social media.
The future is the past on rewind. Or something like that. In this series of articles, we will examine the historical precedents that brought us to this juncture.
This report is intended to address concepts of copyright, authorship, and fair use within the current context of generative artificial intelligence. But first we’re going to investigate the origins of copyright to understand how we arrived at this point.
We will begin with the broad historical narrative: why do we have copyright laws anyway? What is copyright for? Where did it come from? What problem does it solve? Why do governments carve out rights for public use of private intellectual property? What is the public domain? What is fair use?
Then we will address topics that pertain to the moment, particularly the topics of generative AI fair training, generative derivative works, and AI infringement.
Acknowledgement: While compiling this report, I solicited the perspectives of several legal professionals, including Neville Johnson and Daniel Lifschitz of Johnson & Johnson LLP, and Josh Lawler of ZuberLawler LLP. These attorneys kindly provided answers to my questions and guided me to relevant case law. I’ve also enjoyed lively conversations with Che Chang, General Counsel of OpenAI, and Amir Ghavi of Fried, Frank, Harris, Shriver, and Jacobson LLP. I met Che, Amir and other lawyerswhen we participated at the “Disruption” conferences hosted by Creative Commons in Los Angeles. These events were focused on the topics of generative AI and intellectual property law.
I am grateful to these legal experts for their professional courtesy and for sharing their knowledge with me. However, I must point out that none of them have reviewed or written the information below; the errors and inaccuracies contained in this document are my responsibility alone. I welcome suggestions, corrections, clarifications, and constructive criticism from any reader.
Disclaimer: It’s necessary to emphasize that this document is not intended to provide legal advice: I am not an attorney-at-law. If you are contending with a copyright issue or have a question about the fair use of training data, then you should seek professional legal advice as necessary and appropriate. For complex subject matter like copyright and artificial intelligence, it is always best to consult an attorney.
Subsequent articles in this series will address these topics:
Part One: The historical struggles that shaped modern copyright law
1. Why do we need copyright laws?
2. How copyright emerged from the violence at the end of the Middle Ages.
3. How the church preserved order by maintaining ideological purity.
4. How the printing press expanded the information sphere and broke the church’s grip.
5. Now new book formats gave expression to new ideas.
6. How information markets emerged and accelerated cultural production.
7. The backlash included censorship and suppression.
8. How Martin Luther used the printing press to advance the Reformation.
9. How propaganda turned ideas into weapons in an ideological conflict.
10. How the proliferation of new ideas divides the populace.
11. How the first scientists exchanged ideas via printed work.
12. How standardization enabled the birth of the modern era.
13. How printing became a dangerous profession in the 1500s.
14. How secular governments displaced the Church as guardians of ideological purity.
15. Piracy and economic risk.
16. Forerunners of Copyright: monopoly, patents, privilege, cartels, and censorship.
17. How royal letters of patent led to copyright in England.
18. 1710: the birth of modern copyright.
19. The History of Copyright in America.
20. Until 1891, The USA was the World’s Greatest Pirate Nation.
21. The Campaign by US Media Companies to Restore the Perpetual Monopoly.
22. Abuse of Copyright Law today.
Part Two: Modern definitions of US copyright
23. What is infringement?
24. What is derivative work?
25. What is the public domain?
26. What is fair use?
27. What is transformative use?
Part Three: Q& A about Copyright and Generative Artificial Intelligence
28. Will I infringe if I make a copy of someone else’s work?
29. Will I infringe if I study another creator’s work and get inspired to make my own original work?
30. Will I infringe if I am inspired to make new work in the style of another artist?
31. Will I infringe if I study the entire catalog of works by a particular artist and then make a series of paintings that resemble that artist’s work?
32. How does this pertain to generative AI?
33. If I train an AI on the work created by a particular artist, and then that AI generates new work in that artist’s style, have I infringed?
34. What if the AI system generates work “in the style” of a particular artist?
35. What about the company that developed the large language model and the generative AI app? Are they guilty of “contributory infringement” or “vicarious infringement”?
36. If the large language model creates a copy of a work during the training phase, isn’t that infringement?
37. What about scale of learning? The LLMs are trained on vast numbers of original works by writers, artists and other creative people? Does scale make a difference?
38. Who is the author of a copyrighted work? Who really owns the copyright?
39. Can a work generated by artificial intelligence be copyrighted?
Okay, that’s it for the first article in the series “Copyright in the Era of Artificial Intelligence.” The next article will focus on the what the world was like before we had copyright laws, and how the printing press created a wave of social and cultural transformation that can teach us something about what to expect in the era of machine intelligence. If you missed my series of articles about AI and the WGA strike, you can find them here.
I'm eagerly looking forward to your future predictions on how copyright battles will shape the future of generative AI. Keep up the great work.