This story draft by @textmodels has not been reviewed by an editor, YET.
2.1 Data Collection and Curation
2.2 Model Training, 2.3 Model Access and 2.4 Model Evaluation
3 From a Model to an Ecosystem
3.1 GPT4All-J: Repository Growth and the implications of the LLaMA License
3.2 GPT4All-Snoozy: the Emergence of the GPT4All Ecosystem
3.3 The Current State of GPT4All
By enabling access to large language models, the GPT4All project also inherits many of the ethical concerns associated with generative models. Principal among these is the concern that unfiltered language models like GPT4All enable malicious users to generate content that could be harmful and dangerous (e.g., instructions on building bioweapons). While we recognize this risk, we also acknowledge the risk of concentrating this technology in the hands of a limited number of increasingly secretive research groups. We believe that the risk of focusing on the benefits of language model technology significantly outweighs the risk of misuse, and hence we prefer to make the technology as widely available as possible.
Finally, we realize the challenge in assigning credit for large-scale open source initiatives. We make a first attempt at fair credit assignment by explicitly including the GPT4All open source developers as authors on this work, but recognize that this is insufficient fully characterize everyone involved in the GPT4All effort. Furthermore, we acknowledge the difficulty in citing open source works that do not necessarily have standardized citations, and do our best in this paper to provide URLs to projects whenever possible. We encourage further research in the area of open source credit assignment, and hope to be able to support some of this research ourselves in the future.
Nomic AI. 2023. Atlas. https://atlas.nomic.ai/.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance.
Yuvanesh Anand, Zach Nussbaum, Brandon Duderstadt, Benjamin Schmidt, and Andriy Mulyar. 2023. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
BBC News. 2023. Chatgpt banned in italy over privacy concerns. BBC News.
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. 2023. Pythia: A suite for analyzing large language models across training and scaling.
Harrison Chase. 2022. langchain. https://github. com/langchain-ai/langchain.
Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Meng, Jianwei Xie, Jun Wan, Ali Ghodsi, Patrick Wendell, and Matei Zaharia. 2023a. Hello dolly: Democratizing the magic of chatgpt with open models.
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023b. Free dolly: Introducing the world’s first truly open instructiontuned llm.
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A dialogue model for academic research. Blog post.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.
imartinez. 2023. privategpt. https://github.com/ imartinez/privateGPT.
Oscar Leo. 2023. GitHub: The Fastest Growing Repositories of All Time.
Robert McMillan. 2023. A meta platforms leak put powerful ai in the hands of everyone. The Wall Street Journal.
MindsDB. 2023. Mindsdb. https://github.com/ mindsdb/mindsdb. GitHub repository.
MosaicML-Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-08-07.
Nous-Research. 2023a. gpt4-x-vicuna-13b. https://huggingface.co/NousResearch/ gpt4-x-vicuna-13b. Model on Hugging Face.
Nous-Research. 2023b. Nous-hermes-13b. https://huggingface.co/NousResearch/ Nous-Hermes-13b. Model on Hugging Face.
Nous-Research. 2023c. Nous-hermes-llama-2-7b. https://huggingface.co/NousResearch/ Nous-Hermes-llama-2-7b. Model on Hugging Face.
Nous-Research. 2023d. Redmond-puffin-13b. https://huggingface.co/NousResearch/ Redmond-Puffin-13B. Model on Hugging Face.
OpenAI. 2023. Gpt-4 technical report.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization.
Stability-AI. 2023. Stablelm. https://github.com/ Stability-AI/StableLM. GitHub repository.
StanGirard. 2023. quivr. https://github.com/ StanGirard/quivr. GitHub repository.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
The Verge. 2023. Meta’s powerful ai language model has leaked online — what happens now? The Verge.
James Vincent. 2023. As an ai generated language model: The phrase that shows how ai is polluting the web. The Verge.
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/ mesh-transformer-jax.
Eric J. Wang. 2023. alpaca-lora. https://github. com/tloen/alpaca-lora. GitHub repository.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Yuvanesh Anand, Nomic AI, [email protected];
(2) Zach Nussbaum, Nomic AI, [email protected];
(3) Adam Treat, Nomic AI, [email protected];
(4) Aaron Miller, Nomic AI, [email protected];
(5) Richard Guo, Nomic AI, [email protected];
(6) Ben Schmidt, Nomic AI, [email protected];
(7) GPT4All Community, Planet Earth;
(8) Brandon Duderstadt, Nomic AI, [email protected] with Shared Senior Authorship;
(9) Andriy Mulyar, Nomic AI, [email protected] with Shared Senior Authorship.