Maybe you don’t mind if GitHub Copi­lot used your open-source code with­out ask­ing.
But how will you feel if Copi­lot erases your open-source com­mu­nity?

Hello. This is Matthew Butterick. I’m a writer, designer, pro­gram­mer, and law­yer. I’ve writ­ten two books on typog­ra­phy—Prac­ti­cal Typog­ra­phy and Typog­ra­phy for Lawyers—and designed the fonts in the MB Type library, includ­ing Equity, Con­course, and Trip­li­cate.

As a pro­gram­mer, I’ve been pro­fes­sion­ally involved with open-source soft­ware since 1998, includ­ing two years at Red Hat. More recently I’ve been a con­trib­u­tor to Racket. I wrote the Lisp-advo­cacy essay Why Racket? Why Lisp? and Beau­ti­ful Racket, a book about mak­ing pro­gram­ming lan­guages. I’ve released plenty of open-source soft­ware, includ­ing Pollen, which I use to pub­lish my online books, and even AI soft­ware that I use in my work.

In June 2022, I wrote about the legal prob­lems with GitHub Copi­lot, in par­tic­u­lar its mis­han­dling of open-source licenses. Recently, I took the next step: I reac­ti­vated my Cal­i­for­nia bar mem­ber­ship to team up with the amaz­ingly excel­lent class-action lit­i­ga­tors Joseph Saveri, Cadio Zir­poli, and Travis Man­fredi at the Joseph Saveri Law Firm on a new project—

We’re inves­ti­gat­ing a poten­tial law­suit against GitHub Copi­lot for vio­lat­ing its legal duties to open-source authors and end users.
We want to hear from you. Click here to help with the inves­ti­ga­tion.
Or read on.

This web page is infor­ma­tional. Gen­eral prin­ci­ples of law are dis­cussed. But nei­ther Matthew Butterick nor any­one at the Joseph Saveri Law Firm is your law­yer, and noth­ing here is offered as legal advice. Ref­er­ences to copy­right per­tain to US law. This page will be updated as new infor­ma­tion becomes avail­able.

What is GitHub Copi­lot?

GitHub Copi­lot is a prod­uct released by Microsoft in June 2022 after a year­long tech­ni­cal pre­view. Copi­lot is a plu­gin for Visual Stu­dio and other IDEs that pro­duces what Microsoft calls “sug­ges­tions” based on what you type into the edi­tor.

What makes Copi­lot dif­fer­ent from tra­di­tional auto­com­plete? Copi­lot is pow­ered by Codex, an AI sys­tem cre­ated by OpenAI and licensed to Microsoft. (Though Microsoft has also been called “the unof­fi­cial owner of OpenAI”.) Copi­lot offers sug­ges­tions based on text prompts typed by the user. Copi­lot can be used for small sug­ges­tions—say, to the end of a line—but Microsoft has empha­sized Copi­lot’s abil­ity to sug­gest larger blocks of code, like the entire body of a func­tion. (I demon­strated Copi­lot in an ear­lier piece called This copi­lot is stu­pid and wants to kill me.)

But how was Codex, the under­ly­ing AI sys­tem, trained? Accord­ing to OpenAI, Codex was trained on “tens of mil­lions of pub­lic repos­i­to­ries” includ­ing code on GitHub. Microsoft itself has vaguely described the train­ing mate­r­ial as “bil­lions of lines of pub­lic code”. But Copi­lot researcher Eddie Aftandil­ian con­firmed in a recent pod­cast (@ 36:40) that Copi­lot is “train[ed] on pub­lic repos on GitHub”.

What’s wrong with Copi­lot?

What we know about Copi­lot raises legal ques­tions relat­ing to both the train­ing of the sys­tem and the use of the sys­tem.

On the train­ing of the sys­tem

The vast major­ity of open-source soft­ware pack­ages are released under licenses that grant users cer­tain rights and impose cer­tain oblig­a­tions (e.g., pre­serv­ing accu­rate attri­bu­tion of the source code). These licenses are made pos­si­ble legally by soft­ware authors assert­ing their copy­right in their code.

Thus, those who wish to use open-source soft­ware have a choice. They must either:

  1. com­ply with the oblig­a­tions imposed by the license, or

  2. use the code sub­ject to a license excep­tion—e.g., fair use under copy­right law.

Microsoft and OpenAI have con­ceded that Copi­lot & Codex are trained on open-source soft­ware in pub­lic repos on GitHub. So which choice did they make?

If Microsoft and OpenAI chose to use these repos sub­ject to their respec­tive open-source licenses, Microsoft and OpenAI would’ve needed to pub­lish a lot of attri­bu­tions, because this is a min­i­mal require­ment of pretty much every open-source license. Yet no attri­bu­tions are appar­ent.

There­fore, Microsoft and OpenAI must be rely­ing on a fair-use argu­ment. In fact we know this is so, because for­mer GitHub CEO Nat Fried­man claimed dur­ing the Copi­lot tech­ni­cal pre­view that “train­ing [machine-learn­ing] sys­tems on pub­lic data is fair use”.

Well—is it? The answer isn’t a mat­ter of opin­ion; it’s a mat­ter of law. Nat­u­rally, Microsoft, OpenAI, and other researchers have been pro­mot­ing the fair-use argu­ment. Nat Fried­man fur­ther asserted that there is “jurispru­dence” on fair use that is “broadly relied upon by the machine[-]learn­ing com­mu­nity”. But Soft­ware Free­dom Con­ser­vancy dis­agreed, and pressed Microsoft for evi­dence to sup­port its posi­tion. Accord­ing to SFC direc­tor Bradley Kuhn—

[W]e inquired pri­vately with Fried­man and other Microsoft and GitHub rep­re­sen­ta­tives in June 2021, ask­ing for solid legal ref­er­ences for GitHub’s pub­lic legal posi­tions … They pro­vided none.

Why couldn’t Microsoft pro­duce any legal author­ity for its posi­tion? Because SFC is cor­rect: there isn’t any. Though some courts have con­sid­ered related issues, there is no US case squarely resolv­ing the fair-use ram­i­fi­ca­tions of AI train­ing.

Fur­ther­more, cases that turn on fair use bal­ance mul­ti­ple fac­tors. Even if a court ulti­mately rules that cer­tain kinds of AI train­ing are fair use—which seems pos­si­ble—it may also rule out oth­ers. As of today, we have no idea where Copi­lot or Codex sits on that spec­trum. Nei­ther does Microsoft nor OpenAI.

On the use of the sys­tem

We can’t yet say how fair use will end up being applied to AI train­ing. But we know that find­ing won’t affect Copi­lot users at all. Why? Because they’re just using Copi­lot to emit code. So what’s the copy­right and licens­ing sta­tus of that emit­ted code?

Here again we find Microsoft get­ting hand­wavy. In 2021, Nat Fried­man claimed that Copi­lot’s “out­put belongs to the oper­a­tor, just like with a com­piler.” But this is a mis­chie­vous anal­ogy, because Copi­lot lays new traps for the unwary.

Microsoft char­ac­ter­izes the out­put of Copi­lot as a series of code “sug­ges­tions”. Microsoft “does not claim any rights” in these sug­ges­tions. But nei­ther does Microsoft make any guar­an­tees about the cor­rect­ness, secu­rity, or exten­u­at­ing intel­lec­tual-prop­erty entan­gle­ments of the code so pro­duced. Once you accept a Copi­lot sug­ges­tion, all that becomes your prob­lem:

“You are respon­si­ble for ensur­ing the secu­rity and qual­ity of your code. We rec­om­mend you take the same pre­cau­tions when using code gen­er­ated by GitHub Copi­lot that you would when using any code you didn’t write your­self. These pre­cau­tions include rig­or­ous test­ing, IP [(= intel­lec­tual prop­erty)] scan­ning, and track­ing for secu­rity vul­ner­a­bil­i­ties.”

What entan­gle­ments might arise? Copi­lot users—here’s one exam­ple, and another—have shown that Copi­lot can be induced to emit ver­ba­tim code from iden­ti­fi­able repos­i­to­ries. Just this week, Texas A&M pro­fes­sor Tim Davis gave numer­ous exam­ples of large chunks of his code being copied ver­ba­tim by Copi­lot, includ­ing when he prompted Copi­lot with the com­ment /* sparse matrix trans­pose in the style of Tim Davis */.

Use of this code plainly cre­ates an oblig­a­tion to com­ply with its license. But as a side effect of Copi­lot’s design, infor­ma­tion about the code’s ori­gin—author, license, etc.—is stripped away. How can Copi­lot users com­ply with the license if they don’t even know it exists?

Copi­lot’s whizzy code-retrieval meth­ods are a smoke­screen intended to con­ceal a grubby truth: Copi­lot is merely a con­ve­nient alter­na­tive inter­face to a large cor­pus of open-source code. There­fore, Copi­lot users may incur licens­ing oblig­a­tions to the authors of the under­ly­ing code. Against that back­drop, Nat Fried­man’s claim that Copi­lot oper­ates “just like … a com­piler” is rather dubi­ous—com­pil­ers change the form of code, but they don’t inject new intel­lec­tual-prop­erty entan­gle­ments. To be fair, Microsoft doesn’t really dis­pute this. They just bury it in the fine print.

What does Copi­lot mean for open-source com­mu­ni­ties?

By offer­ing Copi­lot as an alter­na­tive inter­face to a large body of open-source code, Microsoft is doing more than sev­er­ing the legal rela­tion­ship between open-source authors and users. Arguably, Microsoft is cre­at­ing a new walled gar­den that will inhibit pro­gram­mers from dis­cov­er­ing tra­di­tional open-source com­mu­ni­ties. Or at the very least, remove any incen­tive to do so. Over time, this process will starve these com­mu­ni­ties. User atten­tion and engage­ment will be shifted into the walled gar­den of Copi­lot and away from the open-source projects them­selves—away from their source repos, their issue track­ers, their mail­ing lists, their dis­cus­sion boards. This shift in energy will be a painful, per­ma­nent loss to open source.

Don’t take my word for it. Microsoft cloud-com­put­ing exec­u­tive Scott Guthrie recently admit­ted that despite Microsoft CEO Satya Nadella’s rosy pledge at the time of the GitHub acqui­si­tion that “GitHub will remain an open plat­form”, Microsoft has been nudg­ing more GitHub ser­vices—includ­ing Copi­lot—onto its Azure cloud plat­form.

Obvi­ously, open-source devel­op­ers—me included—don’t do it for the money, because no money changes hands. But we don’t do it for noth­ing, either. A big ben­e­fit of releas­ing open-source soft­ware is the peo­ple: the com­mu­nity of users, testers, and con­trib­u­tors that coa­lesces around our work. Our com­mu­ni­ties help us make our soft­ware bet­ter in ways we couldn’t on our own. This makes the work fun and col­lab­o­ra­tive in ways it wouldn’t be oth­er­wise.

Copi­lot intro­duces what we might call a more self­ish inter­face to open-source soft­ware: just give me what I want! With Copi­lot, open-source users never have to know who made their soft­ware. They never have to inter­act with a com­mu­nity. They never have to con­tribute.

Mean­while, we open-source authors have to watch as our work is stashed in a big code library in the sky called Copi­lot. The user feed­back & con­tri­bu­tions we were get­ting? Soon, all gone. Like Neo plugged into the Matrix, or a cow on a farm, Copi­lot wants to con­vert us into noth­ing more than pro­duc­ers of a resource to be extracted. (Well, until we can be dis­posed of entirely.)

And for what? Even the cows get food & shel­ter out of the deal. Copi­lot con­tributes noth­ing to our indi­vid­ual projects. And noth­ing to open source broadly.

The walled gar­den of Copi­lot is anti­thet­i­cal—and poi­so­nous—to open source. It’s there­fore also a betrayal of every­thing GitHub stood for before being acquired by Microsoft. If you were born before 2005, you remem­ber that GitHub built its rep­u­ta­tion on its good­ies for open-source devel­op­ers and fos­ter­ing that com­mu­nity. Copi­lot, by con­trast, is the Mul­ti­verse-of-Mad­ness inver­sion of this idea.

“Dude, it’s cool. I took SFC’s advice and moved my code off GitHub.” So did I. Guess what? It doesn’t mat­ter. By claim­ing that AI train­ing is fair use, Microsoft is con­struct­ing a jus­ti­fi­ca­tion for train­ing on pub­lic code any­where on the inter­net, not just GitHub. If we take this idea to its nat­ural end­point, we can pre­dict that for end users, Copi­lot will become not just a sub­sti­tute for open-source code on GitHub, but open-source code every­where.

On the other hand, maybe you’re a fan of Copi­lot who thinks that AI is the future and I’m just yelling at clouds. First, the objec­tion here is not to AI-assisted cod­ing tools gen­er­ally, but to Microsoft’s spe­cific choices with Copi­lot. We can eas­ily imag­ine a ver­sion of Copi­lot that’s friend­lier to open-source devel­op­ers—for instance, where par­tic­i­pa­tion is vol­un­tary, or where coders are paid to con­tribute to the train­ing cor­pus. Despite its pro­fessed love for open source, Microsoft chose none of these options. Sec­ond, if you find Copi­lot valu­able, it’s largely because of the qual­ity of the under­ly­ing open-source train­ing data. As Copi­lot sucks the life from open-source projects, the prox­i­mate effect will be to make Copi­lot ever worse—a spi­ral­ing ouroboros of garbage code.

When I first wrote about Copi­lot, I said “I’m not wor­ried about its effects on open source.” In the short term, I’m still not wor­ried. But as I reflected on my own jour­ney through open source—nearly 25 years—I real­ized that I was miss­ing the big­ger pic­ture. After all, open source isn’t a fixed group of peo­ple. It’s an ever-grow­ing, ever-chang­ing col­lec­tive intel­li­gence, con­tin­u­ally being renewed by fresh minds. We set new stan­dards and chal­lenges for each other, and thereby raise our expec­ta­tions for what we can accom­plish.

Amidst this grand alchemy, Copi­lot inter­lopes. Its goal is to arro­gate the energy of open-source to itself. We needn’t delve into Microsoft’s very check­ered his­tory with open source to see Copi­lot for what it is: a par­a­site.

The legal­ity of Copi­lot must be tested before the dam­age to open source becomes irrepara­ble. That’s why I’m suit­ing up.

Help us inves­ti­gate.

I’m cur­rently work­ing with the Joseph Saveri Law Firm to inves­ti­gate a poten­tial law­suit against GitHub Copi­lot. We’d like to talk to you if—

Any infor­ma­tion pro­vided will be kept in the strictest con­fi­dence as pro­vided by law.

We look for­ward to hear­ing from you. You can con­tact me directly at mb@but­t­er­ick­law.com or use the form on the Joseph Saveri Law Firm web­site to reach the inves­ti­ga­tion team.

This web page is infor­ma­tional. Gen­eral prin­ci­ples of law are dis­cussed. But nei­ther Matthew Butterick nor any­one at the Joseph Saveri Law Firm is your law­yer, and noth­ing here is offered as legal advice. Ref­er­ences to copy­right per­tain to US law. This page will be updated as new infor­ma­tion becomes avail­able.
 
On Novem­ber 3, 2022, we filed our ini­tial com­plaint chal­leng­ing GitHub Copi­lot. Please fol­low the progress of the case at github­copi­lotl­it­i­ga­tion.com.