paint-brush
Open AI Codex and Github Copilot Were Allegedly Trained on Copyrighted Materialsby@legalpdf
138 reads

Open AI Codex and Github Copilot Were Allegedly Trained on Copyrighted Materials

by Legal PDF: Tech Court CasesSeptember 7th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Codex and Copilot, two AI code generation tools, are at the center of an ethical debate. Unlike humans, these AI systems lack an understanding of licenses, copyrights, and attribution, raising concerns about code output that may infringe upon copyrighted materials. Explore how these AI models were trained and why their code generation process poses unique challenges in respecting intellectual property rights.
featured image - Open AI Codex and Github Copilot Were Allegedly Trained on Copyrighted Materials
Legal PDF: Tech Court Cases HackerNoon profile picture

DOE v. Github (original complaint) Court Filing, retrieved on November 3, 2022 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 17 of 37.

VII. FACTUAL ALLEGATIONS

D. Codex and Copilot Were Trained on Copyrighted Materials Offered Under Licenses


78. Codex is an AI system. Another way to describe it is a “model.” Without Codex, Copilot, or another AI-code-lookup-tool, code is written both by originating code from the writer’s own knowledge of how to write code as well as by finding pre-written portions of code that—under the terms of the applicable license—may be incorporated into the coding project.


79. Unlike a human programmer that has learned how code works and notices when code it is copying has attached license terms, a copyright notice, and/or attribution, Codex and Copilot were developed by feeding a corpus of material, called “training data,” into them. These AI programs ingest all the data and, through a complex probabilistic process, predict what the most likely solution to a given prompt a user would input is. Though more complicated in practice, essentially Copilot returns the solution it has found in the most projects when those projects are somehow weighted to adjust for whatever variables Codex or Copilot have identified as relevant.


80. Codex and Copilot were not programmed to treat attribution, copyright notices, and license terms as legally essential. Defendants made a deliberate choice to expedite the release of Copilot rather than ensure it would not provide unlawful Output.


81. The words “study” and “training” and “learning” in connection with AI describe algorithmic processes that are not analogous to human reasoning. An AI models cannot “learn” as humans do, nor can it “understand” semantics and context the way humans do. Rather, it detects statistically significant patterns in its training data and provides Output derived from its training data when statistically appropriate. A “brute force” approach like this would not be efficient nor even possible for humans. A human could not memorize, statistically analyze, and easily access thousands of gigabytes of existing code, a task now possible for powerful computers like those that make up Microsoft’s Azure cloud platform. To accomplish the same task, a human may search for Licensed Materials that serve their purpose if they believe such materials exist. And if that human finds such materials, they will probably abide by its License Terms rather than risk infringing its owners’ rights. At the very least, if they incorporate those Licensed Materials into their own project without following its terms they will be doing so knowingly.



Continue Reading Here.


About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.


This court case 3:22-cv-06823-KAW retrieved on September 5, 2023, from Storage.Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.