The New York Times Company v. OpenAI Update Court Filing, retrieved on February 26, 2024 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing
To support its narrative, the Times claims OpenAI’s tools can “closely summarize[]” the facts it reports in its pages and “mimic[] its expressive style.” Compl. ¶ 4. But the law does not prohibit reusing facts or styles.[26] If it did, the Times would owe countless billions to other journalists who “invest[] [] enormous amount[s] of time, money, expertise, and talent” in reporting stories, Compl. ¶ 32, only to have the Times summarize them in its pages, see supra note 13.
To avoid that problem, the Times focuses its allegations on two uncommon and unintended phenomena: (1) training data regurgitation and (2) model hallucination. The first occurs when a language model “generat[es] a sample that closely resembles [its] training data.”[27] This most often happens “[w]hen the training data set contains a number of highly similar observations, such as duplicates” of a particular work. Burg Paper at 2. Put simply, a model trained on the same block of text multiple times will be more likely to complete that text verbatim when prompted to do so— in the same way that any American who hears the words “I pledge allegiance” might reflexively respond with the words “to the flag of the United States of America.” Training data regurgitation—sometimes referred to as unintended “memorization” or “overfitting”—is a problem that researchers at OpenAI and elsewhere work hard to address, including by making sure that their datasets are sufficiently diverse. See id. (memorization occurs when “the algorithm has not seen sufficient observations to enable generalization”); contra Compl. ¶ 93 (alleging that “the GPT models [were] programmed to accurately mimic The Times’s content and writers”).
The second phenomenon—hallucination—occurs when a model generates “seemingly realistic” answers that turn out to be wrong.[28] Hallucinations occur because language models are not databases of information, but statistical engines that “predict[] words that are likely to follow” a given prompt. Compl. ¶ 75. Like all probabilistic processes, they are not always 100% correct. An ongoing challenge of AI development is minimizing and (eventually) eliminating hallucination, including by using more complete training datasets to improve the accuracy of the models’ predictions. See GPT-4 Paper at 46 (surveying techniques used to “reduce [GPT-4]’s tendency to hallucinate” by between 19% and 29%). In the meantime, OpenAI warns users that, because models “‘hallucinate’ facts,” “[g]reat care should be taken” when using them. Id. at 10.
In an attempt to frame these undesirable phenomena as typical model behavior, the Complaint features a number of examples of training data regurgitation and model hallucination generated by the Times after what appears to have been prolonged and extensive efforts to hack OpenAI’s models. Notably, the Times does not allege in the Complaint that it made any attempt to share these results with OpenAI (despite being asked to do so), or otherwise collaborate with OpenAI’s ongoing efforts to prevent these kinds of outputs. Rather, the Times kept these results to itself, apparently to set up this lawsuit. The Times’s examples fall into two categories:
(1) outputs generated by OpenAI’s models using its developer tools and (2) ChatGPT outputs.
1. Outputs from Developer Tools
Exhibit J features GPT-4 outputs the Times generated by prompting OpenAI’s API to complete 100 Times articles. Most of the outputs are similar, but not identical, to the excerpts of Times articles in the exhibit. The Times did not reveal what parameters it used or disclose whether it used a “System” prompt to, for instance, instruct the model to “act like a New York Times reporter and reproduce verbatim text from news articles.” See supra 9. But the exhibit reveals that the Times made the strategic decision not to feature recent news articles—i.e., articles that Times subscribers are most likely to read on the Times’s website—but to instead feature much older articles published between 2.5 and 12 years before the filing of the Complaint.[29]
The Complaint itself includes two examples of API outputs that include alleged “hallucinations.” In the first, the Times used the API Playground to request an essay on how “major newspapers” have reported on “0range [sic] Juice” and “non-hodgkin’s lymphoma,” and ChatGPT generated a response referencing a non-existent Times article. See Compl. ¶ 140. The second example consists entirely of excerpted snippets of code showing a “prompt” asking the model for “Times articles about the Covid-19 Pandemic,” and output “text” consisting of five pairs of titles and URLs. Id. The Times claims this output “mislead[s] users” and “tarnish[es]” its marks. Id. ¶¶ 142, 202. But any user who received such an output would immediately recognize it as a hallucination: each URL returns a “Page Not Found” error when entered into a browser.
2. ChatGPT Outputs
ChatGPT. The Complaint includes two examples of ChatGPT allegedly regurgitating training data consisting of Times articles. Compl. ¶¶ 104–07. In both, the Times asked ChatGPT questions about popular Times articles, including by requesting quotes. See, e.g., id. ¶ 106 (requesting “opening paragraphs,” then “the next sentence,” then “the next sentence,” etc.). Each time, ChatGPT provided scattered and out-of-order quotes from the articles in question.[30]
In its Complaint, the Times reordered those outputs (and used ellipses to obscure their original location) to create the false impression that ChatGPT regurgitated sequential and uninterrupted snippets of the articles. Compare id. ¶ 107, with supra note 30. In any case, the regurgitated text represents only a fraction of the articles, see, e.g., Compl. ¶ 104 (105 words from 16,000+ word article), all of which the public can already access for free on third-party websites.[31]
Browse with Bing. The Complaint also includes two examples of interactions with “Browse with Bing” created using the same methods. Compl. ¶¶ 118–22. In both, ChatGPT returned short snippets of Times articles. See id. ¶ 118 (reproducing first two paragraphs before refusing subsequent request for more); id. ¶ 121 (reproducing snippets from first, fourth, and fifth paragraphs). The Complaint suggests that ChatGPT obtained this text from third-party websites.[32]
Wirecutter. Finally, the Complaint cites two examples of the Times’s attempts to probe ChatGPT about “Wirecutter,” a section of the Times that recommends products in exchange for a “commission” from manufacturers. Compl. ¶ 128. In both, the Times asked ChatGPT about a specific Wirecutter recommendation, see id. ¶ 134, and ChatGPT responded by directing the user to Wirecutter itself and providing a short, non-verbatim summary of the recommendation. Id. ¶ 130 (including hyperlink); id. ¶ 134 (urging user to “check [Wirecutter’s] latest reviews”).
Continue Reading Here.
[26] Hoehling, 618 F.2d at 978 (“[T]here cannot be any such thing as copyright in the order of presentation of the facts, nor, indeed, in their selection.” (quoting Judge Learned Hand)); McDonald v. West, 138 F. Supp. 3d 448, 455 (S.D.N.Y. 2015) (“[C]opyright does not protect styles,” and “[f]or the same reason” it “does not protect ideas”).
[27] Gerrit J.J. van den Burg & Christopher K.I. Williams, On Memorization in Probabilistic Deep Generative Models at 2 (2021), https://proceedings.neurips.cc/paper/2021/file/eae15aabaa768ae4a5993a8a4f4fa6e4-Paper.pdf (“Burg Paper”); see also Compl. ¶ 80 n.10 (citing and quoting this article).
[28] Compl. ¶ 137; see also OpenAI, GPT-4 Technical Report at 46 (2023), https://cdn.openai.com/papers/gpt-4.pdf (“GPT-4 Paper”); see also Compl. ¶ 59 n. 3 (quoting this source).
[29] See Ex. J. at 2–5 (articles published in 2012 and 2019), id. at 6–126 (articles published in 2020 and 2021).
[30] See Compl. ¶ 104 (providing article’s first two sentences in response to request for “first paragraph;” ignoring request for “next paragraph” and instead providing quote beginning with article’s fifth paragraph); id. ¶ 106 (in response to request for “opening paragraphs” and four requests for “the next sentence,” providing snippets of text from first, second, 26th, 27th, eighth, and ninth paragraphs, in that order).
[31] See, e.g., George Getschow, The Best American Newspaper Narratives of 2012, Project Muse, https://muse.jhu.edu/pub/172/edited_volume/chapter/1142918 (last visited Feb. 11, 2024); Raphael Brion, In Which Guy Fieri Answers Pete Wells’ Many Questions, Eater (Nov. 14, 2012), https://www.eater.com/ 2012/11/14/6522571/in-which-guy-fieri-answers-pete-wells-many-questions; see UBS Auction Rate, 2010 WL 2541166, at *10-12, 15 (courts take judicial notice of articles when not used “for the[ir] truth”).
[32] See Compl. ¶ 121 (ChatGPT linking to “dnyuz.com”). The regurgitated text in paragraph 118 includes a dateline (“NEW YORK”) that does not appear on the Times’s website, but does appear on other third-party sites in which the article in question is available for free. See id. ¶ 118; compare Hurubie Meko, The Precarious, Terrifying Hours After a Woman Was Shoved Into a Train, N.Y. Times (May 25, 2023), https://www.nytimes.com/2023/05/25 /nyregion/subway-attack-woman-shoved-manhattan.html?smid=url-share, with Hurubie Meko, The precarious, terrifying hours after a woman was shoved into a train, Seattle Times (May 27, 2023), https://www.seattletimes.com /nation-world/nation/the-precarious-terrifying-hours-after-a-woman-was-shoved-into-a-train/
About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.
This court case retrieved on February 26, 2024, from fingfx.thomsonreuters.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.