DOE v. Github (original complaint) Court Filing, retrieved on November 3, 2022 is part of . You can jump to any part in this filing . This is part 18 of 37. HackerNoon’s Legal PDF Series here VII. FACTUAL ALLEGATIONS E. Copilot Was Launched Despite Its Propensity for Producing Unlawful Outputs 82. GitHub and OpenAI have not provided much detail regarding what data Codex and OpenAI were trained on. Plaintiffs know for certain from GitHub and OpenAI’s statements, that both systems were trained on publicly available GitHub repositories, with Copilot having been trained on all available public GitHub repositories. Thus, if Licensed Materials have been posted to a GitHub public repository, Plaintiffs and the Class can be reasonably certain it was ingested by Copilot and is sometimes returned to users as Output. 83. According to OpenAI, Codex was trained on “billions of lines of source code from publicly available sources, including code in public GitHub repositories”. Similarly, GitHub has described[13] Copilot’s training material as “billions of lines of public code.” GitHub researcher Eddie Aftandilian confirmed in a recent podcast[14] that Copilot is “train[ed] on public repos on GitHub.” 84. In a recent customer-support message, GitHub’s support department clarified certain facts about training Copilot. First, GitHub said that “training for Codex (the model used by Copilot) is done by OpenAI, not GitHub.” Second, in its support message, GitHub put forward a more detailed justification for its use of copyrighted code as training data: Training machine learning models on publicly available data is considered fair use across the machine learning community . . . OpenAI’s training of Codex is done in accordance with global copyright laws which permit the use of publicly accessible materials for computational analysis and training of machine learning models, and do not require consent of the owner of such materials. Such laws are intended to benefit society by enabling machines to learn and understand using copyrighted works, much as humans have done throughout history, and to ensure public benefit, these rights cannot generally be restricted by owners who have chosen to make their materials publicly accessible. The claim that training ML models on publicly available code is widely accepted as fair use is not true. And regardless of this concept’s level of acceptance in “the machine learning community,” under Federal law, it is illegal. 85. Former GitHub CEO Nat Friedman said in June 2021—when Copilot was released to a limited number of customers—that “training ML systems on public data is fair use.”[15] Friedman’s statement is pure speculation; no Court has considered the question of whether “training ML systems on public data is fair use.” The Fair Use affirmative defense is only applicable to Section 501 copyright infringement. It is not a defense to violations of the DMCA, Breach of Contract, nor any other claim alleged herein. It cannot be used to avoid liability here. At the same time Friedman asserted “the output [of Copilot] belongs to the operator.” 86. Other open-source stakeholders have made this point already. For example, in June 2021, Software Freedom Conservancy (“SFC”), a prominent open-source advocacy organization, asked Microsoft and GitHub to provide “legal references for GitHub’s public legal positions.” No references were provided by any of the Defendants.[16] 87. Beyond the examples above, Copilot regularly Output’s verbatim copies of Licensed Materials. For example, Copilot reproduced verbatim well-known code from the game Quake III, use of which is governed by one of the Suggested Licenses—GPL-2.[17] 88. Copilot also reproduced code that had been released under a license that allowed its use only for free games and required attribution by including a copy of the license. Copilot did not mention nor include the underlying license when providing a copy of this code as Output.[18] 89. Texas A&M computer-science professor Tim Davis has provided numerous examples of Copilot reproducing code belonging to him without its license or attribution.[19] 90. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times. 91. In June 2022, Copilot had 1,200,000 users. If only 1% of users have ever received Output based on Licensed Materials and only once each, Defendants have “only” breached Plaintiffs’ and the Class’s Licenses 12,000 times. However, each time Copilot outputs Licensed Materials without attribution, the copyright notice, or the License Terms it violates the DMCA three times. Thus, even using this extreme underestimate, Copilot has “only” violated the DMCA 36,000 times.[20] Because Copilot constantly Outputs code as a user writes, and because nearly all of Copilot’s training data was Licensed Material, this number is most likely exponentially lower than the true number of breaches and DMCA violations. [13] https://github.blog/2021-06-30-github-copilot-research-recitation/. [14] https://www.se-radio.net/2022/10/episode-533-eddie-aftandilian-on-github-copilot/. [15] https://twitter.com/natfriedman/status/1409914420579344385/. [16] https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/. [17] https://twitter.com/stefankarpinski/status/1410971061181681674/. [18] https://twitter.com/ChrisGr93091552/status/1539731632931803137/. [19] https://twitter.com/DocSparse/status/1581461734665367554/. [20] These violations of Section 1202 of the DMCA each incur statutory damages of “not less than $2,500 or more than $25,000.” 17 U.S.C. § 1203(c)(3)(B). This extremely conservative estimate of Defendants’ number of direct violations translates to $90 million to $900 million in statutory damages. Continue Reading . Here About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings. This court case 3:22-cv-06823-KAW retrieved on September 5, 2023, from is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction. Storage.Courtlistener

How Open-Source Licenses Began to Appear in the Early 1990s

DOE v. Github: Defendants Will Continue to Mislead Consumers on the Source of Copilot

Too Long; Didn't Read

Boost your HackerNoon story @ $159.99! 🚀

Copilot Was Launched Despite a Tendency to Produce Unlawful Outputs

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

20% of All General Searches in the U.S. Go Through the Default on User-Downloaded Version of Chrome

A Third Example of GitHub Copilot (Allegedly) Reproducing the Code of Others

Allegations of Copyright Infringement by GitHub Copilot

ChatGPT, Copilot, and Copyright Issues

ChatGPT vs Copilot vs Programmers: Who's Coming Out on Top?

20% of All General Searches in the U.S. Go Through the Default on User-Downloaded Version of Chrome

A Third Example of GitHub Copilot (Allegedly) Reproducing the Code of Others

Allegations of Copyright Infringement by GitHub Copilot

ChatGPT, Copilot, and Copyright Issues

ChatGPT vs Copilot vs Programmers: Who's Coming Out on Top?

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps