Github's Copilot Committed Software Piracy at an Unprecedented Levelby@legalpdf
1,471 reads
1,471 reads

Github's Copilot Committed Software Piracy at an Unprecedented Level

by Legal PDFSeptember 6th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Contrary to and in violation of the Licenses, code reproduced by Copilot never includes attributions to the underlying authors.

People Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Github's Copilot Committed Software Piracy at an Unprecedented Level
Legal PDF HackerNoon profile picture

DOE v. Github (original complaint) Court Filing, retrieved on November 3, 2022 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 2 of 37.


1. Plaintiffs and the Class are owners of copyright interests in materials made available publicly on GitHub that are subject to various licenses containing conditions for use of those works (the “Licensed Materials.”). All the licenses at issue here (the “Licenses”) contain certain common terms (the “License Terms”).

2. “Artificial Intelligence” is referred to herein as “AI.” AI is defined for the purposes of this Complaint as a computer program that algorithmically simulates human reasoning or inference, often using statistical methods. Machine Learning (“ML”) is a subset of AI in which the behavior of the program is derived from studying a corpus of material called training data.

3. GitHub is a company founded in 2008 by a team of open-source enthusiasts. At the time, GitHub’s stated goal was to support open-source development, especially by hosting open-source source code on the website Over the next 10 years, GitHub, based on these representations succeeded wildly, attracting nearly 25 million developers.

4. Developers published Licensed Materials on GitHub pursuant to written Licenses. In particular, the most popular ones share a common term: use of the Licensed Materials requires some form of attribution, usually by, among other things, including a copy of the license along with the name and copyright notice of the original author.

5. On October 26, 2018, Microsoft acquired GitHub for $7.5 billion. Though some members of the open-source community were skeptical of this union, Microsoft repeated one mantra throughout: “Microsoft Loves Open Source”. For the first few years, Microsoft’s representations seemed credible.

6. Microsoft invested $1 billion in OpenAI LP in July 2019 at a $20 billion valuation. In 2020, Microsoft became exclusive licensee of OpenAI’s GPT-3 language model—despite OpenAI’s continued claims its products are meant to benefit “humanity” at large. In 2021, Microsoft began offering GPT-3 through its Azure cloud-computing platform. On October 20, 2022, it was reported that OpenAI “is in advanced talks to raise more funding from Microsoft” at that same $20 billion valuation. Copilot runs on Microsoft’s Azure platform. Microsoft has used Copilot to promote Azure’s processing power, particularly regarding AI.

7. On information and belief, Microsoft obtained a partial ownership interest in OpenAI in exchange for its $1 billion investment. As OpenAI’s largest investor and largest service provider—specifically in connection with Microsoft’s Azure product—Microsoft exerts considerable control over OpenAI.

8. In June 2021, GitHub and OpenAI launched Copilot, an AI-based product that promises to assist software coders by providing or filling in blocks of code using AI. GitHub charges Copilot users $10 per month or $100 per year for this service. Copilot ignores, violates, and removes the Licenses offered by thousands—possibly millions—of software developers, thereby accomplishing software piracy on an unprecedented scale. Copilot outputs text derived from Plaintiffs’ and the Class’s Licensed Materials without adhering to the applicable License Terms and applicable laws. Copilot’s output is referred herein as “Output.”

9. On August 10, 2021, OpenAI debuted its Codex product, which converts natural language into code and is integrated into Copilot. (Copilot and Codex can be called either AIs or MLs. Herein they will be referred to as AIs unless a distinction is required.)

10. Though Defendants have been cagey about what data was used to train the AI,[2] they have conceded that the training data includes data in vast numbers of publicly accessible repositories on GitHub,[3] which include and are limited by Licenses.

11. Among other things, Defendants stripped Plaintiffs’ and the Class’s attribution, copyright notice, and license terms from their code in violation of the Licenses and Plaintiffs’ and the Class’s rights. Defendants used Copilot to distribute the now-anonymized code to Copilot users as if it were created by Copilot.

12. Copilot is run entirely on Microsoft’s Azure cloud-computing platform.

13. Copilot often simply reproduces code that can be traced back to open-source repositories or open-source licensees. Contrary to and in violation of the Licenses, code reproduced by Copilot never includes attributions to the underlying authors.

14. GitHub and OpenAI have offered shifting accounts of the source and amount of the code or other data used to train and operate Copilot. They have also offered shifting justifications for why a commercial AI product like Copilot should be exempt from these license requirements, often citing “fair use.”

15. It is not fair, permitted, or justified. On the contrary, Copilot’s goal is to replace a huge swath of open source by taking it and keeping it inside a GitHub-controlled paywall. It violates the licenses that open-source programmers chose and monetizes their code despite GitHub’s pledge never to do so.

[2] “Training” an AI, as described in greater detail below, means feeding it large amounts of data that it interprets using given criteria. Feedback is then given to it to fine-tune its Output until it can provide Output with minimal errors.

[3] Repositories are containers for individual coding projects. They are where GitHub users upload their code and where other users can find it. Most GitHub users have multiple repositories.

Continue Reading Here.

About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case 3:22-cv-06823-KAW retrieved on September 5, 2023, from Storage.Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.