paint-brush
Building Production-Ready Generative AI: Five Things Not to Doby@mhartley
237 reads

Building Production-Ready Generative AI: Five Things Not to Do

by Miranda HartleyNovember 2nd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Through working alongside extremely talented machine learning experts from different firms building generative AI applications, it’s clear that there are a number of ‘yikes’ aspects to developing fully realized generative AI tools. In this article, I’ll briefly summarise five pitfalls to avoid when developing a generative AI application.
featured image - Building Production-Ready Generative AI: Five Things Not to Do
Miranda Hartley HackerNoon profile picture

Introduction

Building generative AI requires working with a variety of developing machine learning and AI technologies. It also requires buckets of patience. These two requirements are not unrelated.


Through working alongside extremely talented machine learning experts from different firms building generative AI applications, it’s clear that there are a number of ‘yikes’ aspects to developing fully realized generative AI tools. In this article, I’ll briefly summarise five pitfalls to avoid when developing a generative AI application.

Don’t Assume the Outputs of LLMs Will Be Accurate

This may seem obvious, but I was surprised at how unreliable the output of commercial LLMs could be. When testing major LLMs (in particular, ChatGPT, Anthropic’s Claude AI, and Amazon Bedrock) on complex financial datasets, I noted that they can generate errors or hallucinations at a rate of one per page. LLMs are guilty of often generating information that sounds plausible but is false.


For example, when testing financial data, I found that LLMs produced financial ratios like ‘Non-operating income and expense’. These sound legitimate but don’t exist in the document and are completely hollow financial terms.


Though effective prompt engineering can mitigate some of these hallucinations, many of them occur for initially unexplainable reasons (hence why every LLM has a recommendation for manually reviewing its output somewhere in its user interface and documentation).


OCR (Optical Character Recognition)—the technology that converts images to machine-readable text—also generates a high rate of errors. It is challenging to develop a mechanism to combat OCR infidelity for each model. AI-based adaptive learning for user-flagged errors can be a helpful strategy but requires huge amounts of manual intervention.


Overall, the lack of a calibrated confidence score in commercial LLMs is a huge limitation. Application developers must develop their own methods for calculating confidence and validating errors.

2. Don’t Be Complacent With Your Production Credits

Any generative AI dependent on LLM production credits (such as Google’s Tensor Processing Units (TPUs)) needed to be technologically regulated to avoid spending more than accounted for - potentially bankrupting the company or developer.


Bankruptcy through overenthusiastic spending of production credits might seem like an urban myth - certainly, there are no noted cases of production credits running amok - but better safe than sorry, right?


There are several ways to stretch out your production credits to their fullest potential:


  • Take advantage of short-lived, discounted, preemptible VMs from your cloud application provider (but only if your application is designed to handle interruptions and then resume from checkpoints).


  • Leverage spot instances for non-critical workloads to significantly reduce costs while automatically shutting down idle instances. Cloud provider tools can be very helpful for this.


  • Where appropriate, you could experiment with open-source frameworks that don’t require production credits. I’m very pro the democratic approach of open source models and their potential for collaborative innovation. And, of course, open-source models can be accessed for free.

3. Don’t Forget to Test User Experience (UE)

Again, it may seem obvious, but when stuck looking at a profusion of code, it’s easy to forget that a real human being will be using the product. In a production environment, consider the pathways the user will take, even in complex and code-heavy environments. Don’t be like Google Bard, whose generative AI model couldn’t answer simple user questions about space upon release or, later, what to put on pizza (hint: it’s not glue).


Many traditional UX testing tools can be adapted for generative AI products - such as a card sorting tool. Another effective way to test the technology is with a human-in-the-loop system, like beta testers or reviewers. At the moment, the company I’m with is using human beta testers to test their new generative AI technology. The testers use the tool for free - and they can harvest the usage patterns.

4. Don’t Forget the Importance of Training Data

Clean, normalize, and potentially enrich your data to improve the training process. Techniques like tokenization and feature engineering might be helpful.


Generally, the more data, the better (the AI algorithms I used were from a document store of 25 million documents, for example), but too much data may result in overfitting or computational bottlenecks.


Slightly off-topic, but in the future, it’s possible that wrangling with training data may not be an issue. Promising advancements like AutoML automatically cleanse training data and assess the model’s performance using a number of techniques like meta-learning, Bayesian inference, and neural architecture search. For SMEs with limited coding resources, AutoML might be a promising innovation.

5. Don’t Choose the Wrong Scale Modeling Strategy

There are many ways to scale a model. For example, you might split the model across multiple machines (model parallelism) or replicate the model across multiple machines (data parallelism). Complex models will likely benefit from the model’s parallelism’s device distribution; larger datasets or models with smaller architectures may benefit from increased throughput.


Another consideration is whether to upgrade compute resources on a single machine, such as by increasing GPU memory (vertical scaling) or adding more resources (e.g. GPUs) - also known as horizontal scaling. Don’t forget to containerize your generative AI application to ensure consistent behaviors across different environments.


Of course, rigorous testing and validation after model scaling is a must. To ensure it can handle the increased load, consider trying a variety of by-the-book testing plus load and stress testing.

Summary

It’s unrealistic to expect building a generative AI product to work without encountering at least one of these challenges. Ultimately, each of these challenges presents a fork in decision-making. At each juncture, the choice you make gets you one step closer to building a mature, production-ready model.


Working with the latest advances in academic and industrial machine learning will help counter some of the typical frustrations, as machine learning’s competitive landscape is constantly pushing out new innovations. I organize the London Machine Learning Meetup (the largest community of AI experts in Europe), which is a free community that hosts events that unpack the latest technical advances in machine learning.


Above all, accuracy and cost-effectiveness don’t have to be mutually exclusive. With controlled and strategic experimentation, building a generative AI product can be more rewarding and less frustrating than you might think.


Good luck!