A couple months back, there was a declaration that open source generative artificial intelligence models will dominate the field. Pundits cite a leaked Google memo that states the search giant has lost its competitive advantage in the field of generative AI because of open source models.
The argument goes something like this:
- Open source machine learning algorithms have exceeded the capabilities of proprietary algorithms.
- When using open source algorithms to train models on open source data sets, the performance of the “foundational” models are quite good with respect to benchmarks.
- Using techniques like “fine-tuning” (the process of combining your data with the open source data) to build a model obviates the need to use Big Tech’s proprietary data sets.
- Therefore, proprietary models are dead.
Then Google I/O 2023 happened. Google Bard, a generative AI search engine built on its own proprietary dataset, got rave reviews. The most-cited feature is its ability to incorporate real-time data into its model.
Let’s take a look at why proprietary models will play a valuable role in the future with an analysis of the argument above:
- Have open source machine learning algorithms exceeded the capabilities of proprietary algorithms? Yes. Google’s internal memo discusses how the performance and innovation of algorithms by the open source community has eclipsed its own pace of development.
- When using open source algorithms to train models on open source data sets, are the performance of “foundational” models good with respect to benchmarks? Beware of benchmarks. If the goal of the model is only to understand English, then using an open source corpus of data is fine. But what if your model can benefit from “real-time” data— just like how users benefit from Bard’s real-time data search? Then the benchmark will need to be the ability to understand English and understand recent events in the world.
- Do techniques like “fine-tuning” to build a model obviate the need to use Big Tech’s proprietary data sets? Again, what do your users care about? Can your proprietary dataset bring ALL the real-time context you need ?
- So are proprietary foundational models really dead? Not so fast …
The Cost of Generative AI Success
It turns out that getting access to real-time data to build models is expensive. Google spends billions of dollars to build infrastructure to index the web in real time to build their generative models, and you can bet it’s going to be proprietary.
Let’s take the example of two airline travel chatbots built on top of two different foundational models; one chatbot is open source and one is proprietary with real-time data. The travel chatbot is “fine-tuned” with a proprietary flight information data set to recommend which flights to take. In many cases, both chatbots will provide the same answer. However, if a large storm hits an airport, the chatbot built with proprietary real-time data will provide flight information that avoids flights that are affected by the storm. This is invaluable to users, and hence will be valuable to developers too.
The Future of Foundational AI Models
So does this mean that every generative AI use case needs a foundational model built from proprietary real-time data? No, but there are other reasons why a proprietary foundational model will be needed:
- Proprietary first-party data sets Consider this example: Google Bard leverages the entirety of YouTube to create its foundational model. If your generative AI use case can benefit from the vast amount of information and knowledge that is uploaded to YouTube, then you might want to use a foundational model from Google.
- Personalization data sets When a foundational model is trained with personalized data, the model (aka the neural network) will have aspects of personal information in it. Using these models to do inference can be done in a way that doesn’t leak personal information, but if the entire model is exported, it is possible to extract personal information on particular users by looking at the parameters of the model. Despite the advances in federated learning, there isn’t a foolproof way to enable the model to be exported without jeopardizing privacy.
So what do future foundational models look like? Probably something like this:
- Algorithms will be open source
- Data sets will be proprietary in some cases, due to the cost of maintaining a real-time data set and personalization, and open source in others.
Assuming this is the prevailing architecture, what are the secondary effects?
- Enterprises looking to build generative AI will likely need to rely on foundational models from large companies that have the checkbook to maintain their own real-time data infrastructure, and open source foundation models for other use cases
- The proprietary data set that enterprises rely on will also increasingly be real time too. Expect that data to reside in NoSQL real-time databases like Apache Cassandra, streamed into the feature stores using technologies like Apache Pulsar.
- For practical purposes, model inference will likely happen at data centers owned by the foundational model providers such as AWS, Microsoft and Google. This means the hyperscalers will likely increase in importance in the age of AI. Model inference based on foundational open source models may be performed in customers’ data centers.
The secondary effects for DataStax (my employer) are significant too. As a data management provider, our investment in providing services in the cloud through DataStax Astra DB, which resides on the major clouds of AWS, Microsoft, and Google, is likely to grow as generative AI becomes more prevalent in the enterprise.
While we encourage and support the use of open source foundational models from companies like HuggingFace, we are also forming strong AI partnerships with the big three cloud providers. Most importantly, we are using the community contribution process to upstream features to Cassandra such as vector search to ensure that companies can create their own real data sets for real-time AI.
By Alan Ho, DataStax