A couple months back, there was a declaration that will dominate the field. Pundits cite a that states the search giant has lost its competitive advantage in the field of generative AI because of open source models. open source generative artificial intelligence models leaked Google memo The argument goes something like this: Open source machine learning algorithms have exceeded the capabilities of proprietary algorithms. When using open source algorithms to train on open source data sets, the performance of the “foundational” models are quite good with respect to benchmarks. models Using techniques like “fine-tuning” (the process of combining your data with the open source data) to build a model obviates the need to use Big Tech’s proprietary data sets. Therefore, proprietary models are dead. Then happened. , a generative AI search engine built on its own proprietary dataset, got . The most-cited feature is its ability to incorporate real-time data into its model. Google I/O 2023 Google Bard rave reviews Let’s take a look at why proprietary models will play a valuable role in the future with an analysis of the argument above: Yes. Google’s internal memo discusses how the performance and innovation of algorithms by the open source community has eclipsed its own pace of development. Have open source machine learning algorithms exceeded the capabilities of proprietary algorithms? Beware of benchmarks. If the goal of the model is only to understand English, then using an open source corpus of data is fine. But what if your model can benefit from “real-time” data— just like how users benefit from Bard’s real-time data search? Then the benchmark will need to be the ability to understand English understand recent events in the world. When using open source algorithms to train models on open source data sets, are the performance of “foundational” models good with respect to benchmarks? and Again, what do your users care about? Can your proprietary dataset bring ALL the real-time context you need ? Do techniques like “fine-tuning” to build a model obviate the need to use Big Tech’s proprietary data sets? Not so fast … So are proprietary foundational models really dead? The Cost of Generative AI Success It turns out that getting access to real-time data to build models is expensive. Google spends billions of dollars to build infrastructure to index the web in real time to build their generative models, and you can bet it’s going to be proprietary. Let’s take the example of two airline travel chatbots built on top of two different foundational models; one chatbot is open source and one is proprietary with real-time data. The travel chatbot is “fine-tuned” with a proprietary flight information data set to recommend which flights to take. In many cases, both chatbots will provide the same answer. However, if a large storm hits an airport, the chatbot built with proprietary real-time data will provide flight information that avoids flights that are affected by the storm. This is invaluable to users, and hence will be valuable to developers too. The Future of Foundational AI Models So does this mean that every generative AI use case needs a foundational model built from proprietary real-time data? No, but there are other reasons why a proprietary foundational model will be needed: Proprietary first-party data sets Consider this example: Google Bard leverages the entirety of YouTube to create its foundational model. If your generative AI use case can benefit from the vast amount of information and knowledge that is uploaded to YouTube, then you might want to use a foundational model from Google. Personalization data sets When a foundational model is trained with personalized data, the model (aka the neural network) will have aspects of personal information in it. Using these models to do inference can be done in a way that doesn’t leak personal information, but if the entire model is exported, it is possible to extract personal information on particular users by looking at the parameters of the model. Despite the advances in federated learning, there isn’t a foolproof way to enable the model to be exported without jeopardizing privacy. So what do future foundational models look like? Probably something like this: Algorithms will be open source Data sets will be proprietary in some cases, due to the cost of maintaining a real-time data set and personalization, and open source in others. Assuming this is the prevailing architecture, what are the secondary effects? Enterprises looking to build generative AI will likely need to rely on foundational models from large companies that have the checkbook to maintain their own real-time data infrastructure, and open source foundation models for other use cases The proprietary data set that enterprises rely on will also increasingly be real time too. Expect that data to reside in NoSQL real-time databases like Apache Cassandra, streamed into the feature stores using technologies like Apache Pulsar. For practical purposes, model inference will likely happen at data centers owned by the foundational model providers such as AWS, Microsoft and Google. This means the hyperscalers will likely increase in importance in the age of AI. Model inference based on foundational open source models may be performed in customers’  data centers. The secondary effects for DataStax (my employer) are significant too. As a data management provider, our investment in providing services in the cloud through , which resides on the major clouds of AWS, Microsoft, and Google, is likely to grow as generative AI becomes more prevalent in the enterprise. DataStax Astra DB While we encourage and support the use of open source foundational models from companies like , we are also forming strong AI partnerships with the big three cloud providers. Most importantly, we are using the community contribution process to upstream features to Cassandra such as to ensure that companies can create their own real data sets for real-time AI. HuggingFace vector search By Alan Ho, DataStax Also published . here

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

Proprietary AI Models Are Dead -- or Are They?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

3 Key Tools for Deploying AI/ML Workloads on Kubernetes

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

3 Key Tools for Deploying AI/ML Workloads on Kubernetes

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps