paint-brush
GPT4All-J: Repository Growth and the Implications of the LLaMA Licenseby@textmodels
New Story

GPT4All-J: Repository Growth and the Implications of the LLaMA License

tldt arrow

Too Long; Didn't Read

The LLaMA model that GPT4All was based on was licensed for research only, which severely limited the set of domains that GPT4All could be applied in.
featured image - GPT4All-J: Repository Growth and the Implications of the LLaMA License
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Abstract and 1. Introduction

2 The Original GPT4All Model

2.1 Data Collection and Curation

2.2 Model Training, 2.3 Model Access and 2.4 Model Evaluation

3 From a Model to an Ecosystem

3.1 GPT4All-J: Repository Growth and the implications of the LLaMA License

3.2 GPT4All-Snoozy: the Emergence of the GPT4All Ecosystem

3.3 The Current State of GPT4All

4 The Future of GPT4All

Limitations and References

3 From a Model to an Ecosystem

3.1 GPT4All-J: Repository Growth and the implications of the LLaMA License

The GPT4All repository grew rapidly after its release, gaining over 20000 GitHub stars in just one week, as shown in Figure 2. This growth was supported by an in-person hackathon hosted in New York City three days after the model release, which attracted several hundred participants. As the Nomic discord, the home of online discussion about GPT4All, ballooned to over 10000 people, one thing became very clear - there was massive demand for a model that could be used commercially.


The LLaMA model that GPT4All was based on was licensed for research only, which severely limited the set of domains that GPT4All could be applied in. As a response to this, the Nomic team repeated the model training procedure of the original GPT4All model, but based on the already open source and commercially licensed GPT-J model (Wang and Komatsuzaki, 2021). GPT4All-J also had an augmented training set, which contained multi-turn QA examples and creative writing such as poetry, rap, and short stories. The creative writing prompts were generated by filling in schemas such as "Write a [CREATIVE STORY TYPE] about [NOUN] in the style of [PERSON]." We again employed Atlas to curate the prompt-response pairs in this data set.


Our evaluation methodology also evolved as the project grew. In particular, we began evaluating GPT4All models using a suite of seven reasoning tasks that were used for evaluation of the Databricks Dolly (Conover et al., 2023b) model, which was released on April 12, 2023. Unfortunately, GPT4All-J did not outperform other prominent open source models on this evaluation. As a result, we endeavoured to create a model that did.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Yuvanesh Anand, Nomic AI, [email protected];

(2) Zach Nussbaum, Nomic AI, [email protected];

(3) Adam Treat, Nomic AI, [email protected];

(4) Aaron Miller, Nomic AI, [email protected];

(5) Richard Guo, Nomic AI, [email protected];

(6) Ben Schmidt, Nomic AI, [email protected];

(7) GPT4All Community, Planet Earth;

(8) Brandon Duderstadt, Nomic AI, [email protected] with Shared Senior Authorship;

(9) Andriy Mulyar, Nomic AI, [email protected] with Shared Senior Authorship.