paint-brush
DOE vs. GitHub: Plaintiffs Claim Codex & Copilot Were Trained With Copyrighted Materialby@legalpdf

DOE vs. GitHub: Plaintiffs Claim Codex & Copilot Were Trained With Copyrighted Material

by Legal PDFSeptember 3rd, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The emergence of AI-driven programming tools like Codex and Copilot has revolutionized the way code is written and reused. Unlike human programmers, these systems lack the ability to understand legal concepts like copyright, attribution, and licensing. This excerpt explores how these AI models are trained on copyrighted data, their probabilistic approach to problem-solving, and the resulting challenges in upholding copyright laws and ethical programming practices. The deliberate choice to prioritize expedited releases over legality raises questions about the responsibility of developers in ensuring lawful output. The AI's statistical pattern recognition, while efficient, stands in stark contrast to human reasoning and decision-making. The excerpt emphasizes the need for a nuanced approach to copyright compliance in the realm of AI-generated content.
featured image - DOE vs. GitHub: Plaintiffs Claim Codex & Copilot Were Trained With Copyrighted Material
Legal PDF HackerNoon profile picture

DOE vs. Github (amended complaint) Court Filing (Redacted), June 8, 2023 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 16 of 38.

VII. FACTUAL ALLEGATIONS

D. Codex and Copilot Were Trained on Copyrighted Materials Offered Under Licenses

82. Codex is an AI system. Another way to describe it is a “model.” Without Codex, Copilot, or another AI-code-lookup-tool, code is written both by originating code from the writer’s own knowledge of how to write code as well as by finding pre-written portions of code that—under the terms of the applicable license—may be incorporated into the coding project.


83. Unlike a human programmer that has learned how code works and notices when code it is copying has attached license terms, a copyright notice, and/or attribution, Codex and Copilot were developed by feeding a corpus of material, called “training data,” into them. These AI programs ingest all the data and, through a complex probabilistic process, predict what the most likely solution to a given prompt a user would input is. Though more complicated in practice, essentially Copilot returns the solution it has found in the most projects when those projects are somehow weighted to adjust for whatever variables Codex or Copilot have identified as relevant.


84. Codex and Copilot were not programmed to treat attribution, copyright notices, and license terms as legally essential. Defendants made a deliberate choice to expedite the release of Copilot rather than ensure it would not provide unlawful Output.


85. The words “study” and “training” and “learning” in connection with AI describe algorithmic processes that are not analogous to human reasoning. AI models cannot “learn” as humans do, nor can it “understand” semantics and context the way humans do. Rather, it detects statistically significant patterns in its training data and provides Output derived from its training data when statistically appropriate. A “brute force” approach like this would not be efficient nor even possible for humans. A human could not memorize, statistically analyze, and easily access thousands of gigabytes of existing code, a task now possible for powerful computers like those that make up Microsoft’s Azure cloud platform. To accomplish the same task, a human may search for Licensed Materials that serve their purpose if they believe such materials exist. And if that human finds such materials, they will probably abide by its License Terms rather than risk infringing its owners’ rights. At the very least, if they incorporate those Licensed Materials into their own project without following its terms they will be doing so knowingly.



Continue Reading Here.


About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.


This court case 4:22-cv-06823-JST retrieved on August 26, 2023, from Storage Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.