paint-brush
The Times v. Microsoft/OpenAI: Models Exhibit a Behavior Called “Memorization.” (9)by@legalpdf

The Times v. Microsoft/OpenAI: Models Exhibit a Behavior Called “Memorization.” (9)

by Legal PDFJanuary 2nd, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

An LLM works by predicting words that are likely to follow a given string of text based on the potentially billions of examples used to train it.
featured image - The Times v. Microsoft/OpenAI: Models Exhibit a Behavior Called “Memorization.” (9)
Legal PDF HackerNoon profile picture

The New York Times Company v. Microsoft Corporation Court Filing December 27, 2023 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 9 of 27.

IV. FACTUAL ALLEGATIONS

B. Defendants’ GenAI Products

2. How GenAI Models Work


75. At the heart of Defendants’ GenAI products is a computer program called a “large language model,” or “LLM.” The different versions of GPT are examples of LLMs. An LLM works by predicting words that are likely to follow a given string of text based on the potentially billions of examples used to train it.


76. Appending the output of an LLM to its input and feeding it back into the model produces sentences and paragraphs word by word. This is how ChatGPT and Bing Chat generate responses to user queries, or “prompts.”


77. LLMs encode the information from the training corpus that they use to make these predictions as numbers called “parameters.” There are approximately 1.76 trillion parameters in the GPT-4 LLM.


78. The process of setting the values for an LLM’s parameters is called “training.” It involves storing encoded copies of the training works in computer memory, repeatedly passing them through the model with words masked out, and adjusting the parameters to minimize the difference between the masked-out words and the words that the model predicts to fill them in.


79. After being trained on a general corpus, models may be further subject to “finetuning” by, for example, performing additional rounds of training using specific types of works to better mimic their content or style, or providing them with human feedback to reinforce desired or suppress undesired behaviors.


80. Models trained in this way are known to exhibit a behavior called “memorization.”[10] That is, given the right prompt, they will repeat large portions of materials they were trained on. This phenomenon shows that LLM parameters encode retrievable copies of many of those training works.


81. Once trained, LLMs may be provided with information specific to a use case or subject matter in order to “ground” their outputs. For example, an LLM may be asked to generate a text output based on specific external data, such as a document, provided as context. Using this method, Defendants’ synthetic search applications: (1) receive an input, such as a question; (2) retrieve relevant documents related to the input prior to generating a response; (3) combine the original input with the retrieved documents in order to provide context; and (4) provide the combined data to an LLM, which generates a natural-language response.[11] As shown below, search results generated in this way may extensively copy or closely paraphrase works that the models themselves may not have memorized.


Continue Reading Here.


[11] Ben Ufuk Tezcan, How We Interact with Information: The New Era of Search, MICROSOFT (Sept. 19, 2023), https://azure.microsoft.com/en-us/blog/how-we-interact-with-information-the-new-era-of-search/.



About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.


This court case 1:23-cv-11195 retrieved on December 29, 2023, from nycto-assets.nytimes.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.