The Truth about AI and copyright that nobody will say out loud

The stories we tell about copyright won’t survive contact with national interest

"You wouldn't train an LLM" meme callback to the “You wouldn’t steal a car” music industry TV ad of the 2000s

Did you receive this forwarded from a friend?

A work of analysis

This isn’t a neutral piece — it’s an analytical one. But the analysis lands hard on some deeply held assumptions in the content world. It takes a position — emphatically not against creators, but against a set of myths about copyright and control. What follows is an argument grounded in facts, not feelings.

"AS IS" Dealer Warranty Disclaimer

The argument over AI training and its relationship to copyright is in an uneasy period before any major court decisions or new laws have arrived. Artists and writers feel extracted from, not partnered with. On Bluesky you’ll find the conversation saturated with hostility for AI and its community, as if the tech itself is immoral. This has more than a bit to do with the reflex for online communities to quickly sort themselves over new national issues, even those that don’t have any natural political orientation.

We need to revisit the foundations of the legal, philosophical, and economic landscape that led to our current system of copyright and intellectual property. There has never before been a technology that required the combined knowledge of humanity to create it. In that sense AI is completely unique in human history, and assuming the answers to the AI copyright issue based on the assumptions of our current legal framework doesn’t work.

This piece isn’t a defense of Big AI or an attack on creators. It’s a walk through how our current copyright balance was struck, how powerful interests use rhetorical sleight-of-hand to shape the discourse, and what will break as we move toward the next one.

Let’s begin by looking at what the fight is over: How AI actually makes use of copyrighted data.

Some basic facts about AI model training

AI models are trained through a process of exposure to massive datasets, and given feedback on their outputs during learning that nudges them toward better outputs. Ten years ago, this kind of training required datasets to be “labeled”, or given the desired “answer” for every training sample. This is very limiting, in that the cost to generate labels (via human annotation) quickly becomes infeasible for increasingly large datasets. And so the research community developed “unsupervised” training methods that don’t require labels.

LLMs are trained based on such an unsupervised method: next-token1 prediction (think of it as word prediction). You may have heard of this in a dismissive way: that “all LLMs are doing” is predicting the next token, that LLMs are just “stochastic parrots”. Don’t believe it. Emergent behavior is a thing, in the growth of snowflakes and in AI models.

The point is that finding a way to train LLMs without labels led to the ability to use much larger datasets — datasets that contain, roughly, the entire web.

“AIs just memorize all their training data”

AI models are sometimes prone to "memorization", a phenomenon in which the model learns the training data without generalizing its learning to new data. This is considered very bad from a model performance perspective, and as such researchers have strived to prevent it (successfully).

The goal of AI model training is emphatically not to memorize the training data - it is to use the training data to learn generalized representations of the knowledge contained in the training data. You could also refer to those “generalized representations” as concepts.

When people learn, this is also the goal. In the past, education placed a higher emphasis on memorization, but as technology supplemented our ability to "recall" information about the world, via the printing press and later the Internet, we’ve deemphasized rote memorization more and more as a goal of learning (thankfully).

From early on, the notion of copyright was defined to protect the expression of an idea, but not the idea itself. A U.S. Supreme Court case, Baker v. Selden, 101 U.S. 99 (1879), established this "idea-expression dichotomy". “Fair use” then further reduces the scope protection of the expression.

Reading, listening or watching a work, no matter how many times or how intently, as a person does during study or recreation, doesn't infringe an author's copyright. Even if by studying one filmmaker’s peasants, you’re inspired to create another expression of them as droids.

Kurosawa’s peasants and Lucas’ droids

The process of training an AI model on a copyrighted work is clearly a form of study intended to learn concepts. As a simple factual statement, successful model training does not enable the model to produce a copy of the expression found in its training data.

Natural law

It's easy to understand why humans developed a prohibition on the theft of physical items, from first principles. Taking an item from its holder is a crime because it harms the holder through the loss of value and use of the item. Taking an item is a clearly defined action. In this sense, the development of the idea of property is natural.

The taking of an idea is not so well defined. The creator of the idea still possesses the idea and, assuming it's a good idea, society will usually benefit in the immediate sense from more people having knowledge of it. “He who receives an idea from me, receives instruction himself without lessening mine; as he who lights his taper at mine, receives light without darkening me.”2

But there‘s a cost to creating ideas, and if the creators of ideas have no opportunity to enjoy the benefits of their work, they may stop creating them — a tragedy of the commons.

But ideas are so different in terms of the consequences of their taking that calling the taking of an idea theft is a category error. The very existence of the public domain shows this to be true: in our law, the idea taken on a Tuesday before a copyright expires is a crime, but the same idea taken on the following Friday is not even cause for a civil action. This is a very unusual kind of crime. The resulting harms (or not), and benefits (or not) to society are much less clear-cut.

The original rationale for copyrights and patents

The U.S. Constitution's Copyright Clause provides a rationale for Congress’ power to grant copyright and patent rights for a limited period: "To promote the Progress of Science and useful Arts".

The framers weren’t the originators of the idea of intellectual property. The Statute of Monopolies (1624, England) established a statutory framework for limited patents. Readers of the statute today might be surprised by its formulation. It reads as an anti-corruption and anti-monopoly act, focused on banning a large set of monopolies already granted as gifts by the crown through letters patent:

all Monopolies, and all Commissions, Grants, Licences, Charters and Letters Patents heretofore made or granted, or hereafter to be made or granted [...] are altogether contrary to the Laws of this Realm, and so are and shall be utterly void and of none Effect, and in no wise to be put in Use or Execution.

The statute then defines some very narrow exceptions to the ban just described, limiting grants of patents only to "true and first inventors", for a maximum of 14 years.

Copyright statute followed in the Statute of Anne (1710, England), which put the point right in the title: “An Act for the Encouragement of Learning, by vesting the Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned”.

“For the Encouragement of Learning”… “To promote the Progress of Science and useful Arts.”.

Intellectual property since the internet

The last 25 years of history between tech and the content industries have not been helpful to AI developers. From the release of the file sharing app Napster3 , tech became enemy #1 for the content industry, with the threat from Napster and successive technologies passing step-by-step from audio, to text, to video.

It might surprise zoomers, but in 1999 when Napster was released, the content industries were incredibly powerful in the U.S., with a record of getting what they wanted from Congress. And tech was very much not a powerful constituency. Google’s hiring of its first lobbyist in 2006 was big news.4

The content industry tried very hard to position every use of computer or internet tech for media distribution to be theft — whether or not there was intellectual property violation or not. Their framing was that the capability to share media over the Internet was criminal, unsigned artists and public domain material deemed irrelevant.

The industry’s goal was to preserve their cartel position as distributors of music and other content. This message transformed into permanent self-satire following the 2004 "You Wouldn't Steal a Car" TV message ad, spawning memes skewering IP lawsuits and overreach to this day.

Meme with the text "You wouldn't reimplement an API"

Meme inspired by “You wouldn’t steal a car” during the Google-Oracle case over Java APIs

The text "You wouldn't screenshot an NFT" superimposed on police car lights

Meme inspired by “You wouldn’t steal a car” during the initial NFT crypto craze over owning GIFs

But the distribution cartels faced an impossible task. They kept their content rights, but they couldn’t keep their distribution monopoly-driven control.

This distinction between rights over content and control over content is a distinction that the content industry always tried to muddy. By controlling distribution, they enforced scarcity and commoditized their supply (the artists). But Internet distribution made it possible for any creator to publish their content, fracturing the artificial scarcity and freeing creators from cartel distributors.

Last gasps of the content distribution cartels

The content business' war on tech has always been more about trying to contain a new distribution mechanism that allows a new channel to market for creators, rather than defraying actual damages from distribution of their content libraries.

No other industry enjoys statutory penalties so vastly disproportionate to actual damages as copyright law does. In Capitol Records, Inc. v. Thomas-Rasset, damages of $1.92 million were awarded for making 24 songs available on a file sharing network.

Creators inside the publisher system didn’t see much benefit in opening distribution to mere amateurs. Andrew Keen wrote in his 2007 book The Cult of the Amateur: How Today's Internet Is Killing Our Culture that:

“By stealing away our eyeballs, the blogs and wikis are decimating the publishing, music and news-gathering industries that created the original content those Web sites 'aggregate'. Our culture is essentially cannibalizing its young, destroying the very sources of the content they crave.”

It's difficult today to overstate just how wrong this idea was: The breaking of the distribution monopolies unleashed enormous benefits to both artists and consumers.

Creators like Ben Thompson, Curt Jaimungal, and Marques Brownlee entered the field as “amateurs”. Each produces better material than ever came out of the gatekeeper-controlled pre-Internet media — and makes much more money doing it — by being masters of their own distribution rather than being “signed”. The Internet multiplied the amount of content available, at the absolute top and bottom of the quality scale, and put Talent in control.

The fight was never about piracy. It was about preserving scarcity in a world where scarcity was dying.

Oughts and ises

David Hume first showed that you can’t derive an ought from an is. The arguments put forth by content distributors over intellectual property have always played fast and loose with claims that silently transform ises into oughts. So it’s important that we trace our reasoning and don’t fall prey to muddy arguments when examining our ideas about intellectual property.

The balance struck between the interests of creators and those of society worked. It may have been unfair in certain respects, but on the whole it worked. And where it got things wrong, society's loss has been of the kind that doesn't rise to impacting the fates of countries or deciding people’s lifespans. But that balance was struck in an era that’s nearly gone, and in our new era the balance we end up with will impact things like geopolitics and lifespans.

The argument over how to decide the fate of companies training AI models on copyrighted material can be had inside the existing bounds of copyright law (the ises), or outside the existing bounds (the oughts). When you’re forming your own views, recognize that claims of oughts that point to the existing structure of intellectual property are playing a rhetorical trick. And so you need to examine the oughts that led to the current system, and consider how you would strike a new balance with the benefits, costs, and requirements of AI factored in.

The part that nobody will say out loud

There’s just one problem — None of that matters.

Copyright will not be allowed to block the development of AI.

Because the ultimate decision won’t be made by courts, or creators, or companies. It will be made by states, with stakes that go beyond royalties on Disney films.

As I said earlier, AI is the first technology that needs the combined knowledge of humanity as an input to make it work. Critics will say that models created from that knowledge are just extractive playthings, but this is dangerously wrong.

What AI actually is, is a technological substrate, the way chemistry or electricity was. As we’ve discussed in past issues, a direct way to internalize what this means is to see that AI will make intelligence zero marginal cost, and plentiful.

The balance between society and many parochial interests, not just the content industry, will be turned upside down. And no interest group will be granted the right to elevate their interests above society’s future. Whatever courts may decide, whatever laws may be (temporarily) passed, the national interest is now so firmly bound to AI that it can’t be any other way.

The fact is that the potential of AI can’t be realized without the use of datasets containing the aggregate knowledge of humanity. And the potential of AGI is the potential of the human species.

AGI is now a matter of national interest — and national interest doesn’t ask permission.

Need expert AI guidance?
Book a 1:1 strategy call with me

Have feedback? Are there topics you’d like to see covered?
Reach out at: jeff @ roadtoartificia.com

What did you think of this issue?

Login or Subscribe to participate in polls.

1  If you’re unfamiliar with tokens, think of them as words optimized for the AI model.

2  This passage has come to be known as “Jefferson’s Taper”. Thomas Jefferson.

3  Napster was the first successful music filesharing app.

4  "They are brilliant engineers," said Lauren Maddox, a principal in the bipartisan lobbying firm Podesta Mattoon that was hired by Google last year. "They are not politicians."
New York Times: Google Joins the Lobbying Herd (archive link: here)