2.1 Data Collection and Curation
2.2 Model Training, 2.3 Model Access and 2.4 Model Evaluation
3 From a Model to an Ecosystem
3.1 GPT4All-J: Repository Growth and the implications of the LLaMA License
3.2 GPT4All-Snoozy: the Emergence of the GPT4All Ecosystem
3.3 The Current State of GPT4All
GPT4All-Snoozy was developed using roughly the same procedure as the previous GPT4All models, but with a few key modifications. First, GPT4All-Snoozy used the LLaMA-13B base model due to its superior base metrics when compared to GPT-J. Next, GPT4All-Snoozy incorporated the Dolly’s training data into its train mix. After data curation and deduplication with Atlas, this yielded a training set of 739,259 total prompt-response pairs. We dubbed the model that resulted from training on this improved dataset GPT4All-Snoozy. As shown in Figure 1, GPT4All-Snoozy had the best average score on our evaluation benchmark of any model in the ecosystem at the time of its release.
Concurrently with the development of GPT4All, several organizations such as LMSys, Stability AI, BAIR, and Databricks built and deployed open source language models. We heard increasingly from the community that they wanted quantized versions of these models for local use. As we realized that organizations with ever more resources were developing source language models, we decided to pivot our effort away from training increasingly capable models and towards providing easy access to the plethora of models being produced by the open source community. Practically, this meant spending our time compressing open source models for use on commodity hardware, providing stable and simple high level model APIs, and supporting a GUI for no code model experimentation.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Yuvanesh Anand, Nomic AI, [email protected];
(2) Zach Nussbaum, Nomic AI, [email protected];
(3) Adam Treat, Nomic AI, [email protected];
(4) Aaron Miller, Nomic AI, [email protected];
(5) Richard Guo, Nomic AI, [email protected];
(6) Ben Schmidt, Nomic AI, [email protected];
(7) GPT4All Community, Planet Earth;
(8) Brandon Duderstadt, Nomic AI, [email protected] with Shared Senior Authorship;
(9) Andriy Mulyar, Nomic AI, [email protected] with Shared Senior Authorship.