Imagine a future where AI isn't locked away in corporate vaults, but built in the open, brick by brick, by a global community of innovators. Where collaboration, not competition, fuels advancements, and ethical considerations hold equal weight with raw performance. This isn't science fiction, it's the open-source revolution brewing in the heart of AI development. But Big Tech has its own agenda, masking restricted models as open source while attempting to reap the benefits of a truly open community.
Let's peel back the layers of code and unveil the truth behind these efforts. This exploration of the future of open-source AI will dissect the “pretenders” and champion the “real ones” in AI development to uncover the innovation engine that is open-source software humming beneath it all. The bottom line is that open-source AI will beget an open-source data stack.
The Need
A recent article by Matteo Wong in The Atlantic, ‘There Was Never Such a Thing as ‘Open’ AI’ describes a growing trend in academia and the software community for truly open source AI. “The idea is to create relatively transparent models that the public can more easily and cheaply use, study, and reproduce, attempting to democratize a highly concentrated technology that may have the potential to transform work, police, leisure and even religion.” That same Atlantic suggests that Big Tech companies like Meta are trying to fill this need in the market by ‘open-washing’ their products. They are assuming the qualities and positive reputation of the open-source community without truly open-sourcing their product. But, there is no substitute for the real thing. This is because true open-source software drives innovation and collaboration: two qualities that are desperately needed to move forward with AI responsibly.
The Pretenders
LLaMA 2, is a large language model created by Meta that is free to use for both research and commercial uses. Leading some to suggest LLaMA 2 is open source. However, Meta has implemented some severe restrictions on the use of their model. For example, LLaMA 2 cannot be used to improve any other large language model. A position that goes against the traditional private collective innovation model of open software which promotes the free and open revelation of innovation for the benefit of everyone in the software community.
Meta further crippled the use of their model by not allowing integration of LLaMA 2  with products that have 700 million monthly users and by not disclosing what data their model is trained on or the code they used to build it. By not disclosing, Meta is opening itself to questions of inherent bias and accidental discrimination. A model trained on discriminatory data will serve up discriminatory responses. Without the software community at large being able to view either the code used to build the model to see if any safeguards have been built in or the data used to train it, we are left in the dark on these moral questions. In a time when published research on AI is more concerned with performance than justice and respect this obfuscation is particularly disturbing.
The Real Ones
Mistral AI has gained recognition for its open-source large language models, notably Mistral 7B and Mixtral 8x7B. The company strives to ensure broad accessibility to its AI models, encouraging review, modification, and reuse by the open software community.
vLLM stands for "vectorized low-latency model serving" and is an open-source library specifically designed to speed up and optimize large language models (LLMs). It is a powerful tool that can significantly improve the performance and usability of LLMs. This makes it a valuable asset for developers working on a variety of AI applications, from chatbots and virtual assistants to content creation and code generation. So much so that, Mistral recommends using vLLM as the inference server for the 7B and 8x7B models.
EleutherAI is a non-profit AI research lab that has grown from a Discord server for discussing GPT-3 to a leading non-profit research organization. The group is known for its work in training and promoting open science norms in Natural Language Processing. They have released various open-source large language models and are involved in research projects related to AI alignment and interpretability. Their LM-Harness project is probably the leading open-source evaluation tool for language models.
Phi-2 is Microsoft's LLM that punches above its weight. Trained on a blend of synthetic texts and filtered websites, this small, but powerful model excels at tasks like question-answering, summarizing, and translation. What truly sets Phi-2 apart is its focus on reasoning and language understanding, leading to impressive performance even without advanced alignment techniques.
Many competent open-source embedding models are strengthening the overall open-source generative AI space. These are the current state-of-the-art for open source and include UAE-Large-V1 and multilingual-e5-largel.
There are many more in this ever-growing field. This limited list is just a start.
Open Source Drives Innovation
Embracing a philosophy of extreme open innovation, companies that truly participate in open-source software development challenge traditional notions of competitive advantage by acknowledging that not all good code or great ideas reside within their organization. This shift supports the argument that shared innovations within the open-source ecosystem lead to faster market growth, providing even smaller software firms with more limited R&D funds the opportunity to benefit from R&D spillovers present in open-source software. This is because, in contrast to traditional outsourcing, open innovation enhances internal resources by leveraging the collective intelligence of the community, without diminishing internal R&D efforts. Meaning that open-source software companies don’t have to sacrifice their budgets to pursue thought leadership and code outside their organization.
Additionally, open-source software companies strategically drive innovation by releasing code early and often, recognizing the cumulative nature of the innovation process in the software community. All of which to say something many already recognize: Open Source Software drives innovation.
Open Source Fosters Collaboration
Through networking in the open-source software community, entrepreneurs are able to fulfill both short-term and long-term goals. Short-term profit goals build companies and long-term profit goals sustain them. At the same time, this networking effort self-perpetuates the network itself - growing it for the next entrepreneur. It is well known that open-source platforms provide access to the source code, enabling developers to create upgrades, plug-ins and other pieces of software and use them according to their requirements. This particular kind of collaboration experienced a boom with the wide adoption of Kubernetes by the wider software community. Now more than ever, modern technologies work together with very little friction and can be in minutes together almost anywhere.
Big Tech companies acknowledge this deep collaboration inherent to the open-source community when they freely release frameworks, libraries, and languages they created to maintain and develop internal tools. Doing so deepens the pool of developers capable of working on their products and starts to set the standard for how similar technologies should operate. That same Atlantic article quotes Meta founder Mark Zuckerberg as saying it has “been very valuable for us to provide that because now all of the best developers across the industry are using tools that we’re also using internally”.
Open Source Begets Open Source
These are factors in why we very often see synergies between open-source companies. Open-source AI and ML companies will naturally develop solutions with other open-source products from foundational products like object storage to all-way up the stack to visualization tools. When one open-source company steps forward, we all do. This cohesive and blended approach is probably our best bet for developing AI that takes a human-centered approach. These natural forces inherent in the market need for open source AI combined with the qualities of open source software of innovation and collaboration will drive the AI data stack open source.
Please join and contribute to this conversation and our community by emailing us at [email protected] or sending us a message on our Slack channel.