TLDR 1: Dataset is no longer a hard requirement
The idea that you need tons of data to make text-based AI is now partially outdated.
Even an extremely small dataset can produce valuable results (1 sentence ~ 1GB).
Something which nearly all startups are either sitting on or are able to produce by hand.
TLDR 2: It has very broad use cases
You can think of GPT 3.5 as a remote & paid “college intern on demand”
You can never trust interns to be 100% right, but they can provide useful value with proper supervision, guidance, and planning.
A value that can now grow and scale seamlessly(unlike real human interns)
TLDR 3: Cost and Limitations
Classic unit economics model for SaaS and online services will be a challenge due to the extremely high running cost of most LLM
OpenAI also has a (temporary) effective monopoly, for the infrastructure in this space, and can be both a partner and unintentional future competitor at the same time
If you were on holiday, and have no idea what ChatGPT or LLM (large language models) are - you can find the briefing about the technology here:
So what do I mean in longer form?
Grab a cup of coffee and settle in for a bit…
Note: This article is discussing only data requirements for text-based models and does not apply to image models in many cases (e.g., computer vision).
In the past, building an AI model typically involved gathering a large enough dataset for a specific use case. Different AI model training techniques and tools existed to optimize the training process or its computational costs.
However, the size and quality of the dataset were still one of the predominant factors in training an AI.
This would result in a graph something like the following:
All numbers are large approximations and will change drastically depending on the use case, AI model, and data quality. The scales on the axes are intentionally vague and inaccurate because they are subjective. For technical numbers, read the papers in the citations for specific use cases.
The general idea remains, however, where small-scale datasets produce effectively random data before slowly becoming better and finally being somewhat useful at approximately the ~500GB part.
This led to the dataset arms race, occurring in nearly every field of specialization related to AI (especially anything computer vision-related).
Which played out across multiple startups from 2009 onwards until today, with a few notable startups being acquired in part for their valuable datasets (and usually talent).
Because the arms race was constantly escalating, it became increasingly difficult for new startups to compete with their AI models (small datasets) against incumbents (larger datasets).
For example, at uilicious.com (a low-code UI test automation startup), we used our limited dataset to train an AI model (called TAMI v0.1). We found that it was spewing garbage half the time, leading us to set the model aside and grow the company without AI while building up our dataset.
In a very overgeneralized and non-technical fashion, AI trained in this way will be referred to as "Specialised Models" from here onwards.
In the quest for a truly universal or general-purpose AI, especially in the field of human speech interactions (as humans make for the most random of all variables), efforts arose to train a new type of text-based AI for extremely large datasets of public data (think Wikipedia, Quora, StackOverflow, and half of the internet text).
Because these new models pushed the boundaries of dataset sizes and model size (think brain size), the methods used to build them diverged from specialized models (which tend to have more emphasis on accuracy and efficiency).
Text AI models trained in this way are now collectively called "Large Language Models" (LLMs).
The downside of this approach was huge, preventing its early application in many use cases:
LLM was dominated by Google (who already had the data and had the motive to do so for their AI assistant) and OpenAI initially. Later, Microsoft, Amazon, Apple, Salesforce, Meta, and a few others joined the race. However, due to the sheer size and cost to train such a model, it's generally reserved for major tech companies with deep pockets.
Though the first few generations of LLMs may have produced disappointing results, as they lost out to nearly every specialized model in every task, things changed as years passed and they scaled up in both dataset size and model size.
Their benefits became more visible:
This made drastic changes to the curves:
It also sparked a new wave of AI-backed business models in 2019. Startups could now train new AI models on top of existing LLMs with datasets within their reach, from chatbots to presentation slide generators, code copilots, copywriting, and even D&D gamemasters.
No longer were these AI models in the exclusive domain of large tech companies. For a small fee for training and running the AI on their infrastructure, OpenAI and Google started opening up their models for others to build upon as a means to profit from the LLMs.
This was beneficial to startups, too, as they no longer needed millions of dollars in R&D investment required to build these large models internally, allowing them to reach the market faster with their proofs-of-concept.
Data was still, in a way, king. It may have moved the curve to be more accessible, but it still required a team to build up large datasets and tune the model.
As a result, while many of the 2019 startups found it significantly easier to build their prototypes, many found it difficult to cross the "useful" threshold, as they needed to scale up their dataset with diminishing returns.
This is consistent with early internal trials of uilicious.com TAMI v0.2 AI models - while GPT was a huge improvement over garbage, it was still between interesting and "meh".
But that's when things start to really change in 2022....
The release of GPT3.5 (or chatGPT as it's known on the internet) last month (December 2022) had huge implications for the way AI models are trained and fine-tuned.
It has unlocked the possibility of creating usable AI with extremely small datasets - something that most startups have access to or can create by hand. This is a fundamental shift in how we think about AI training.
At uilicious.com, we were stunned when we discovered that a small sample dataset smaller than 1GB, of our larger ~100GB dataset, when converted and optimized with the new training techniques, had exceeded the "useful threshold" - the point at which the AI could be utilized by our users and outperform everything we had previously.
While subsequent experiments with larger datasets showed diminishing returns. The key takeaway was how little data was needed to "make a useful product". It took us less time to build a proof-of-concept AI fine-tuned for our proprietary use case than it took to write this article.
Using GPT3.5 as a basic building block, it is now possible to build useful AI applications for various use cases without needing a specialized team or individual.
Depending on the use case, the dataset can be as small as a single sentence or paragraph, or up to 100MB or 1GB - a size that is achievable for many startups.
Alternatively, if you can get chatGPT to act in a way that you find useful and of value to your startup, you can build it as a dedicated AI service.
The drastic reduction in the required dataset made it possible for us to build a "useful" AI model with only a small fraction of our full dataset - something that had previously been "useless" or "impossible" at our scale.
In many ways, data is no longer king, finding and building useful applications are the true kingmakers with this new AI. Where ideas can be built and tested in weeks (not years).
The screenshot is a demo of our AI in action, generating full UI testing scripts, from a prompt, as an example integration. (one of many possibilities)
Something which is now doable in a week, with the right team, and the right data.
Do reach out to us if you are interested in the future of AI and UI testing.
GPT-3.5 and all other large language models have limits on their accuracy and reliability. This is partly due to their nature of dreaming out answers.
While in theory (yet to be proven), given a large enough specialized dataset, its accuracy can be drastically improved for specialized use cases. However, if you have such a large dataset, chances are you could already build "specialized models".
Ultimately still, this limit on accuracy and reliability is only a critical issue in sensitive industries (e.g., medicine). For most cases, it is merely a distraction in finding use cases. Once the AI crosses the “good enough” threshold (which it does).
A healthier and more realistic approach to the problem is to imagine GPT-3.5 as a remote and paid “smart college intern on demand”
Because realistically, the AI has all the same limitations in such use cases, in addition to being remote and online:
The only real benefit the AI model has over real-life interns is:
The downside the AI has over humans is that they cannot get you coffee in person.
Once you frame it in these terms, it becomes significantly easier to figure out how to integrate AI into existing business processes or products.
On a slightly more technical level:
This is the biggest weakness of this newer approach in AI, which is based on bigger and better models. Unfortunately, there is no free lunch.
While it is cheaper and easier to train for specific tasks in terms of dataset size, it is significantly more expensive to run compared to more traditional AI models.
It is not cheap; the cost per prompt and answer ranges anywhere between one cent and fifty cents, depending on how much data was needed to train or use in the process. This is drastically higher than a typical API server, which can handle a million requests per dollar.
To put it in simpler terms, it costs more in server hardware to handle one AI query for one user within a given second than it does to serve a million user requests for a typical medium-sized Shopify website.
This is not due to OpenAI or Azure solely trying to make a profit; it is down to the pure server hardware infrastructure required to run such large models.
As a result, as powerful as it is, incorporating such a Large Language Model AI has a hefty price tag and may not be applicable for all use cases due to this limitation alone.
The end result is that, while many use cases can benefit from the use of such an AI, not all use cases can afford it; and this should be an important consideration for any implementation.
For a support service desk, a typical support staff can handle ten customers per hour, with each customer having an average of fifteen back-and-forth prompts. If it is five cents per prompt - this adds up to $7.50 s an hour if an AI was used to mimic a single support staff.
This is not only cheaper than the typical US call center staff median salary of $15 per hour, but it is also much more flexible (no staffing overhead, can scale up and down instantly).
Likewise, it is also possible to take the same "intern" approach, where this support AI serves only as L1 support, allowing the humans to handle the more complicated cases. In this scenario, it makes sense when done and scaled appropriately on a per-prompt (or per-hour) basis.
The average office worker replies to approximately forty emails per workday or approximately 880 emails per month. Even at five cents per email, that would be $44 per month on average per user just to handle the email replies.
What makes it worse is that it's reasonable to expect that, with such a service, an office worker would be able to reply to more emails on average. It would not be unreasonable for the average to double to two thousand emails, or one hundred dollars a month, in pure AI costs alone.
In this case, if the SaaS startup were to do a simple, let's say ten-dollar-per-month pricing, they can get into heavy potential losses over time.
This pricing cost and business model runs counter to the typical, per-user pricing model that is common in SaaS. This is why it is common for such integrations to have a "credits" system as a means to cap the usage, and a means of billing, of such an AI.
It is expected that over time, with better fine-tuning, competition, and cost optimization, the price per prompt can go down. Another notable method is to use the original more expensive AI first at launch while collecting more data, which is then used to train a more specialized and cheaper model. However, all of these methods go into deep technical details, which may be unique for each use case, and generally require significant time and effort.
And even then, while it can be a drastic ten times in savings, it is fundamentally more expensive than traditional SaaS API services.
While there are existing open-source Large Language Models, in very frank terms, they are either comparable to GPT2 or somewhere in between that and GPT3.5.
For some simple cases, once they start building up a reasonable dataset, these smaller (and cheaper) models may be useful to migrate into as a means of cost-cutting.
However, for other complex cases, such a move may not be possible due to the complexity of their AI, giving OpenAI an effective monopoly with no incentive for lower pricing.
It is believed, however, that over the course of the next one or two years, the open-source community will catch up and, in doing so, perhaps allow prices to improve due to better alternative infrastructure providers.
However, because this is an uncertain future, it is worth highlighting.
While not intentionally, it is critical that startups within this space build feature sets that are defensible beyond their prompt-to-text bots.
For example, there were several smaller startups that built GPT3-based bots or SaaS around specific prompt-to-text use cases, like name generators or idea generators with a simple interface.
Literally overnight, with the launch of chatGPT, these small single-use-case text-to-text SaaS were made redundant, as everyday folks can now get the same functionality via chatGPT for free.
While it may not have been OpenAI's intention to compete with the very partners that build on them, it may be its inevitable nature to do so, as they keep improving their model and chatGPT.
As such, to ensure this does not repeat itself, it is critical for any business model around this technology to figure out what additional value they provide beyond just prompt-to-text, perhaps better user experience or integrations with existing toolings, etc.
A reminder of the intern model, do not expect to use this for curing cancer tomorrow. So please do not go stuffing AI into every product and startup on Earth when it makes no benefit to the end user to do so.
~ Until next time 🖖 live long and prosper
Eugene Cheah: CTO of uilicious.com
This article was originally posted on the author’s substack
swyx also does an excellent job in trying to consolidate various information, in this rapidly chaotic and growing space, which is very worth reading (recommended!!!)
Currently, BLOOM is the main opensource contender for GPT3 (not 3.5): https://towardsdatascience.com/run-bloom-the-largest-open-access-ai-model-on-your-desktop-computer-f48e1e2a9a32
In the AI / ML industry, the ability of an LLM to quickly learn new concepts and apply them is quantified and measured by tests called "Zero-Shot", "One-Shot" and "Few-Shot" learnings.
Generally the better the AI does in these tests, the less data you need to train it for your use case.
In my opinion: This makes sense in hindsight - who would have thought that a Neural Network modeled after humans, would act like humans? And benefit from the T-Shape education model. Where a large breadth of generalized knowledge, helps improve one’s ability to learn specialized knowledge in one are of expertise. (this statement is not backed by any data)