paint-brush
ChatGPT's Predecessor Was Trained on 'Filtered' Internet Data from an Open Source Repositoryby@legalpdf
147 reads

ChatGPT's Predecessor Was Trained on 'Filtered' Internet Data from an Open Source Repository

tldt arrow

Too Long; Didn't Read

OpenAI wants portions of the NYT's lawsuit against the company be dismissed, arguing the paper presented misleading evidence to the court.
featured image - ChatGPT's Predecessor Was Trained on 'Filtered' Internet Data from an Open Source Repository
Legal PDF: Tech Court Cases HackerNoon profile picture

The New York Times Company v. OpenAI Update Court Filing, retrieved on February 26, 2024 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This part is 3 of 15.

B. The Key to Generalist Language Models: Scale

So OpenAI’s research took the logical next step: “scaling up [] model size” by increasing both the “size” and “diversity” of the training data. See GPT-3 Paper at 6. To build its next generation of models, OpenAI’s researchers gathered a more robust set of data in part by “expand[ing]” WebText database into a new version called “WebText2,” which included material shared over a longer period of time. Id. at 8–9. They also used a filtered version of Common Crawl, a repository of data collected by a non-profit research organization representing “a copy of the Internet.” Compl. ¶ 88. OpenAI disclosed all of this no later than July 22, 2020. GPT-3 Paper at 8. At the time, it was common knowledge that WebText2 and Common Crawl included numerous articles published by the Times.[19]


The result of this simple act of “scaling up” the training data was, as the Times reported at the time, “mind blowing.”[20] The new GPT-3 model was “by far the most powerful ‘language model’ ever created.” Manjoo, supra note 20. It could conduct “on-the-fly reasoning” and “unscrambl[e] words, perform[] arithmetic, and us[e] novel words in a sentence after seeing them defined only once.” GPT-3 Paper at 5. Increasing the scale of training led to a surprising jump in ability. See, e.g., id. at 22 (Figure 3.10). According to the Times’s reporting, GPT-3 showed that “[m]achines are gaining the ability to write.” Manjoo, supra note 20. Within days, developers began to use it to build unprecedented tools. Id. (“service that responds to email on your behalf”).


The “key advance,” as the Times reported, was “GPT-3’s flexibility.” Id. And the reason it was flexible was OpenAI’s decision to “scal[e] up” the “size” of the training data. GPT-3 Paper at 6. The amount of data needed was staggering. Compl. ¶ 85. But it was that “unprecedented scale” that allowed the model to internalize not only a “map of human language,” but achieve a level of adaptability—and “emergent” intelligence—that “no one thought possible.”[21]



Continue Reading Here.


[19] Compl. ¶ 88 (citing a “2019 snapshot of Common Crawl”); GPT-2 Model Card (noting prevalence of Times content in WebText); GPT-3 Paper at 8 (noting WebText2 is an “expanded version of the WebText dataset”).


[20] Farhad Manjoo, How Do You Know a Human Wrote This?, N.Y. Times (July 29, 2020), https://www.nytimes.com/2020/07/29/opinion/gpt-3-ai-automation.html. The Court may take judicial notice of this and other news articles cited in this Motion. In re UBS Auction Rate Sec. Litig., No. 08-cv-2967, 2010 WL 2541166, at *10-12 (S.D.N.Y. June 10, 2010) (judicial notice of “several news items” including Wall Street Journal articles).


[21] Cade Metz, Meet GPT-3. It Has Learned to Code (and Blog and Argue), N.Y. Times (Nov. 24, 2020), https://www.nytimes.com/2020/11/24/science/artificial-intelligence-ai-gpt3.html.



About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.


This court case retrieved on February 26, 2024, from fingfx.thomsonreuters.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.