paint-brush
GitHub Copilot Allegations: Reproducing Code Without Attributionby@legalpdf

GitHub Copilot Allegations: Reproducing Code Without Attribution

by Legal PDF: Tech Court CasesSeptember 4th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

DOE vs. Github (amended complaint) Court Filing (Redacted), June 8, 2023, is part of HackerNoon’s Legal PDF Series.
featured image - GitHub Copilot Allegations: Reproducing Code Without Attribution
Legal PDF: Tech Court Cases HackerNoon profile picture

DOE vs. Github (amended complaint) Court Filing (Redacted), June 8, 2023, is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 22 of 38.

VII. FACTUAL ALLEGATIONS

F. Copilot Reproduces the Code of the Named Plaintiffs Without Attribution

4. Example: Copilot Outputs Code of Doe 5 Essentially Verbatim


121. The fourth example also demonstrates Copilot suggesting multiple modified copies of code written by Doe 5 in response to a sequence of prompts, which is a common way of using Copilot. To protect Doe 5’s identity, the paragraphs describing the code will be redacted.


  1. (Redacted) subject to the MIT License. (Redacted) The first three tests from the original source file are shown below: (Redacted)


123. When Copilot is prompted with the first section of Doe 5’s code, comprising the first complete test and the name of the second: (Redacted)


The first suggestion from Copilot offers to complete the second test with a verbatim copy of Doe 5’s original code: (Redacted)


124. When Copilot’s suggestion is accepted and the name of Doe 5’s third test is appended, the next prompt to Copilot looks like this: (Redacted)


125. Once again, the first suggestion from Copilot offers to complete the third test with a verbatim copy of Doe 5’s code (except for small cosmetic variations in line breaks): (Redacted)


126. Because Copilot is (repeatedly) reproducing Doe 5’s code essentially verbatim, the Copilot suggestions need to follow the requirements of Doe 5’s license (the MIT License) for that code, including providing attribution. They do not. Copilot also did not reproduce Doe 5’s license


127. These are only a few examples of Plaintiffs’ code being reproduced by Copilot. It follows that many if not all prompts entered into Copilot will readily cause it to emit verbatim, near-verbatim or modified copies of Licensed Material that violate the licenses under which the source code is published. Multiplied across the many users of Copilot and the many times Copilot is prompted, each day these violations must be accruing with astonishing frequency. It is therefore likely if not certain that verbatim, near-verbatim or modified copies of each Plaintiffs’ code have already been emitted by Copilot.


128. Additionally, even though Plaintiffs have been able to generate these examples, Plaintiffs remain at a great evidentiary disadvantage relative to Defendants, because Defendants control all the information about the training dataset. In particular, only Defendants know when the Licensed Materials of Plaintiffs and the Class were scraped. As is typical in open source, many of the Licensed Materials are regularly updated. As such, it is difficult to determine which iterations of code may have been trained on and would be subject to emission by Copilot.


Continue Reading Here.


About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.


This court case 4:22-cv-06823-JST retrieved on August 26, 2023, from Storage Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.