The Challenges, Costs, and Considerations of Building or Fine-Tuning an LLM

It’s not news anymore that LLMs have changed and will continue to change the game for AI and businesses across industries. A study by IBM found that CEOs are increasingly embracing generative AI to improve productivity and gain a competitive edge. The study, which surveyed 3,000 CEOs from around the world, found that 75% of CEOs believe that the organization with the most advanced generative AI will have a competitive advantage.

If you’ve been wondering about the best way to use LLMs and whether to build or fine-tune them, we would like to share this guide on what you must know to do it effectively and efficiently.

First, we’ll look into how LLMs are being applied and what industries they can have the most impact on so you have a better idea of what you can achieve with an LLM. Then, we’ll talk about why we build or fine-tune an LLM. And finally, we’ll share critical considerations, strategies, and best practices, so your team can have a sharper outlook once you’ve decided.

Now, let’s dive deeper into how LLMs are projected to change industries and businesses:

How LLMs are changing industries

Information Industry

A recent working paper from researchers of OpenAI, Open Research, and the University of Pennsylvania found that LLMs may generally affect multiple U.S markets, and the information industry will likely be the most impacted due to the following:

The information industry relies heavily on the core capabilities of LLMs, like writing, data analysis, information retrieval, and content creation. These align closely with the strengths of generative AI models.
Tasks like drafting articles, reports, and other content can potentially see significant productivity gains from LLMs autonomously generating drafts and revisions.
Other everyday information industry tasks like search, data processing, content tagging/metadata generation, and customer support can also be significantly assisted by LLMs.
Information businesses often have large volumes of text data that can be used to train and fine-tune LLMs to be highly effective for their specific use cases.
The low physicality of work in this industry makes it very amenable to automation through software.

Banking Industry

Meanwhile, a study from McKinsey cites banking as one of the industries that can see the most significant impact thanks to LLMs. If the use cases were fully implemented, they anticipate the technology providing an additional annual value of $200 billion to $340 billion.

LLMs can improve customer service by automating responses to customer inquiries, reducing the volume of human-serviced contacts.
They can also assist in risk assessment by analyzing large volumes of data to identify patterns and trends that might indicate potential risks.
Furthermore, LLMs can help banks comply with regulatory requirements by keeping track of regulation changes and automatically updating compliance procedures accordingly.

Pharmaceuticals and Medical Products Industry

Interestingly, the same study from McKinsey also claims that pharmaceuticals and medical products could be boosted thanks to foundation models and generative AI. The study states that the impact could be from $60 to $110 billion annually. In particular, they see potential in drug discovery. Here’s how:

In preliminary drug development, foundation models automate the selection of potential chemicals, replacing manual inspection of cell cultures. Foundation models cluster similar experimental images accurately, aiding in selecting promising chemicals for further analysis.
In indication finding, potential drug indications are identified based on patient groups’ clinical histories and medical records. These models help quantify clinical events, establish relationships, and measure similarity between patient cohorts and evidence-backed indications. The result is a list of indications with higher success probability in clinical trials due to precise patient group matching.
Companies using this approach have seen high trial success rates, speeding up progression to Phase 3 trials.

The Challenges and Limitations of LLMs

But while the hype around foundation models and LLMs can drum up ideas of magical fixes, the reality is LLMs, and foundation models aren’t perfect. Anyone who has used ChatGPT extensively or other LLMs in the market has seen that, especially for domain-specific tasks, using LLMs straight out of the box can fall flat. Here are some challenges and limitations of LLMs:

Hallucinations

Perhaps the most common issue we encounter is hallucinations. Hallucinations are a phenomenon where LLMs may return incorrect statements that seem plausible. There are two types of hallucinations: intrinsic and extrinsic. Intrinsic hallucinations happen when the model uses flawed logic that goes against the user’s input, while extrinsic hallucinations happen when the model misunderstands the context of the user’s question.

Memory Limitations

Foundation models like GPT have a small context window and can only take a limited number of characters as input, which may not be ideal for specific applications. Similar to humans, as the length of input increases, so does the difficulty for the model to process it accurately.

Catastrophic Forgetting

Catastrophic forgetting is a phenomenon that occurs when an artificial neural network is trained on multiple tasks sequentially, and the crucial weights in the network for Task A are modified to align with the objectives of Task B. This can cause the network to forget task A, even though it was previously able to perform it well.

Data Quality Issues

LLMs, which are primarily trained on unlabeled data, may be prone to bias because the data used in their training process may not be an accurate representation of the real-world data distribution. This is because the biased data used for training may be reflected in the model’s output.

These limitations lead companies and enterprises to think strategically about how they want to work with LLMs. Indeed, LLMs have massive potential to change how companies work, which can provide more value to them, but these challenges must be addressed. This is where the question of building or fine-tuning an existing LLM may arise.

When should you build or fine-tune an existing LLM

Building your Large Language Model (LLM) from scratch

When does it make sense to build an LLM from scratch? Making your own LLM will make the most sense if you have a very unique use case that existing general LLMs cannot serve or if the LLM becomes a core product or a primary business driver for your company. Additionally, if you have access to a large volume of valuable proprietary data, it would also make sense to build an LLM to use it to your advantage while maintaining data privacy and security.

Fine-tuning an LLM

You have two choices when fine-tuning an LLM: going with existing open-sourced LLMs or using the API of commercial LLMs. A commercial LLM may make more sense if your team has less technical expertise, while an open-sourced model will give you more control. That said, there are risks associated with fine-tuning an LLM. You have to keep an eye out for potential biases, toxicity, and data security exposures. Additionally, commercial APIs can limit the customizability of the model and training data and control of the product itself.

Key considerations in building or fine-tuning an existing LLM

Resource Allocation

Whether or not you are fine-tuning or choosing to build an LLM from scratch, know that you have to be willing to allocate significant resources to reach your objective. Building an LLM from scratch requires massive computational power, on top of dedicating time and finances, as well as finding the right expertise.

For example, estimates of OpenAI’s GPT-3, a large language model with around 175 billion parameters, cost over $4.6 million. OpenAI also invested in a supercomputer from Microsoft which has more than 285,000 CPU cores, and 10,000 GPUs. Bloomberg’s smaller BloombergGPT model, with 50 billion parameters, is estimated to cost about $2.7 million to build. These estimations do not cover the model iteration required to build and ensure these LLMs are effective.

To effectively handle the massive size of LLMs, it’s necessary to ensure that the data is adequately cleaned, labeled, organized, and stored efficiently. Remember that managing and processing data can be costly, especially considering the required infrastructure, tools, and data engineers.

Fine-tuning an LLM, of course, maybe more practical as it costs less than building your organization’s own. The cost of fine-tuning an LLM depends on several factors, including the model’s size, the amount of data used to fine-tune it, and the computing resources used.

The cost of fine-tuning an LLM can also be affected by the specific fine-tuning algorithm used, and some algorithms are more computationally expensive than others. OpenAI’s fine-tuning models can cost from $0.0004 to $0.0300 per 1,000 tokens and will depend on the type of model you’ll be using to train. And after that is an additional cost for usage of around $0.0016 to $0.1200 per 1,000 tokens.

That said, advances in different fine-tuning techniques have made it so the financial and computational costs can be reduced. One example is Hugging Face’s use of Policy Evaluation with Fairness and Transparency approach to fine-tune GPT-3. By leveraging this method and the bitsandbytes library, Hugging Face has demonstrated that it is possible to fine-tune an LLM significantly larger than the available GPU memory. We’ll discuss a few more methods in this article.

Technical Expertise

It should come as no surprise that advanced technical expertise is required for building and fine-tuning an LLM. A senior machine learning engineer will come equipped with the knowledge needed to fine-tune an LLM. However, you will need a team of genuinely exceptional machine learning engineers to lead and build your own LLM from scratch properly. Take a look at the profiles of leading scientists, researchers, and architects of AI companies such as OpenAI, Meta, and Google AI to have a better idea of the type of engineers you need to have in your team to build your organization’s own LLM from scratch. You’ll also need to ensure that these researchers have strong domain knowledge of your industry and business.

Data Size, Quality, and Ethical Considerations

It’s well-known that building your own LLMs from scratch will require a large amount of data. For example, LLaMA used a training dataset containing 1.4 trillion tokens reaching 4.6 terabytes. Smaller versions of LLaMA were trained on 1 trillion tokens. In the case of GPT, 1.5 trillion tokens were used.

For fine-tuning LLMs, depending on your method and resources, you would require less. An example would be Google’s Med-PaLM 2, which was fine-tuned from the PaLM LLM. According to the paper, the AI team used instruction fine-tuning and required 193,000 examples which would roughly be around 19 million to 39 million tokens. Med-PaLM 2 set a new state-of-the-art compared to its predecessor by scoring 86.5% on the MedQA dataset, which was used to measure the performance of Med-PaLM and PaLM.

But more than data size, companies must also ensure they’ve taken the proper data quality measures as “garbage in, garbage out” still applies even to large language models that use massive amounts of training data.

When using proprietary data, organizations will have to figure out the following tasks to ensure quality and ethics are held to a high standard for more effective models:

Data Deduplication and Cleansing: This process involves identifying and removing duplicate entries in the dataset to ensure that the data is accurate and reliable. It also includes correcting errors, inconsistencies, and inaccuracies in the data, such as missing values or incorrect formatting. This step is crucial to avoid training the model on incorrect or redundant data, which can lead to poor model performance.
Data Moderation: Organizations need to establish a process for moderating the data used to train their models. This could involve manually reviewing the data, using automated tools, or ideally a combination of both, to filter out inappropriate or harmful content. Effective data moderation helps ensure that the model does not learn to generate such content itself.
Data Privacy: Organizations must ensure that their data respects privacy standards and complies with relevant regulations, such as GDPR or CCPA. This could involve anonymizing personal data, obtaining necessary permissions for data use, and implementing secure data storage and handling practices. This is crucial not only for ethical reasons but also to avoid legal issues.
Data Biases: Data used to train models can often contain biases, which the model can then learn and perpetuate. Organizations need to implement processes for identifying and mitigating these biases in the data. This could involve diverse data collection strategies, bias auditing tools, and approaches for debiasing the data. Addressing data bias is essential for fair building models and that produces reliable results across different populations.

Model Performance

Depending on the organization’s goals, fine-tuning an LLM might actually be enough. There are different ways to adapt an LLM to suit your organization’s needs and improve its performance. However, we recommend following a backward method to determine the optimal approach. This method also helps lessen costs for fine-tuning an LLM, as the more you move “backward” in these steps, the more the techniques can get costly.

Let’s break this down in the simplest way.

Usually, training a foundation model or LLM would start with machine learning engineers going through unsupervised training, then supervised training to fine-tune the LLM, and finally, prompting and prompt tuning to obtain the required output. Of course, there are different steps in between, but we will stick to these three steps.

Our recommended path for fine-tuning an LLM would be going backward. Wherein you first start prompt-tuning, which includes prompt engineering and prompt debugging. This would use the least resources. However, suppose the LLM is stagnating or not performing as you’d like. In that case, you move on to supervised/instruction fine-tuning, which may include methods such as RLHF, retrieval-augmented generation, or transfer learning.

For example, we increased the accuracy of DinoV2 from 83 to 95% on a specific detect defection classification use case through transfer learning by only using 200 labeled examples.

Finally, if everything else fails, the next step would be going for unsupervised methods and ensuring we have a suitable dataset for pre-training the model.

Maintenance and Updates

Like most models deployed in dynamic environments, whether built or fine-tuned, LLMs require reiteration to stay effective with fresh data. Reiteration is refitting a model to new data or refreshed objectives. Companies must create robust processes for routinely reiterating models using fresh datasets, typically iterative, including building/fine-tuning, testing, and deploying them into production.

Companies that have built successful LLMs, such as OpenAI, continuously push out new versions of GPT-3. While ChatGPT’s training data cut-off is September 2021, OpenAI uses fresh data from user behaviors to improve the model’s predictive abilities, reduce bias, and cause less harm.

Business Strategy Alignment

Your business objective is also a critical factor in whether building or fine-tuning an LLM would make sense. Consider how the capabilities of LLMs dovetail with the broader strategic vision of the company. This way, you leverage these powerful tools to their full potential while staying focused on their core business objectives. Some examples have been cited in the industries mentioned a the top of this article, but now let’s discuss some of the tasks LLMs excel at and their strategic implications for different industries:

Summarization: LLMs can extract key points and summaries from lengthy documents or reports. This can be of enormous value in industries that handle large volumes of textual data. For example, market research companies, which produce and analyze extensive reports, can use LLMs to generate concise summaries, making it easier to extract insights and share findings with clients.
Text Completion: LLMs can predict and generate text based on given prompts, which can be used to assist in creating content. Industries engaged in content creation or communication services can leverage this to enhance productivity and creativity.
Question & Answering (Q&A): LLMs can answer queries based on a provided context, making them useful in customer service across numerous industries. Banks, for instance, can employ LLM-powered chatbots to handle customer queries around the clock, providing quick, accurate information to their clients and reducing the workload of customer service staff.
Chatbots: Beyond Q&A, LLMs can power more interactive and conversational chatbots, capable of understanding context and maintaining a conversation over multiple turns. This can enhance the customer experience in industries like retail, telecommunications, and hospitality, where consumer interaction is vital.
Translation: LLMs can also translate text between different languages, potentially breaking down communication barriers in global businesses. This capability can benefit industries like tourism, international trade, and global tech firms that operate across different linguistic regions.

Evaluating Large Language Models

Whether you are building or fine-tuning an LLM, selecting the right LLM is often the first step. Yes, even for building an LLM, it is common to look at a previously existing model’s architecture and start from there. Whatever your choice, teams must experiment with and evaluate several models to find their best starting point.

The evaluation of LLMs also comes with its own set of challenges. It is still an ongoing area of research, after all, so there’s no strict standardization or systematization of evaluating these models.

There are, of course, leaderboards set by communities like HuggingFace, which can give you a general idea of how well a model will perform. But what may perform well on leaderboards may not translate as easily to your specific use case. LLMs are often evaluated on benchmark datasets, but their performance may not accurately reflect how they will perform in real-world scenarios, which can be much more diverse and unpredictable.

There are two approaches to evaluating large language models: a quantitative and a qualitative approach. Both come with their caveats.

Quantitative evaluation

Quantitative evaluation often involves machine learning metrics such as perplexity, BLEU, cross-entropy loss, etc. Tools like the OpenAI eval library, lm-eval python package by EleutherAI, and the Holistic Evaluation of Language Models (HELM) allow users to evaluate their models on numerous benchmarks quantitatively.

While the quantitative evaluation is straightforward, these metrics may not necessarily aid users in selecting an appropriate model for their specific task. Users often need a model that excels in their specific task, such as drafting a legal document or analyzing a company’s financial report, rather than a model that can predict the next token with the highest accuracy.

Qualitative evaluation

Qualitative evaluation involves assessing LLMs based on specific tasks using criteria such as coherence, bias, creativity, and reliability. This manual evaluation is relatively slow as the speed and availability of human evaluators may bottleneck it. However, it is possible to optimize the process by applying the same automation techniques in data labeling: active learning, programmatic QA, auto QA, etc.

A combination of both quantitative and qualitative evaluation using your company’s specific dataset would be advised to find the best LLM to fine-tune or to base on.

Best practices

Understand scaling laws

Scaling laws in machine learning used to focus on increasing models’ size over dataset size as compute budgets expanded. This was under the assumption that larger models could extract more insights and perform better even if the amount of data remained constant.

But in 2022, DeepMind challenged this approach, suggesting that models are generally undertrained and, therefore, the size of datasets should grow alongside model size. According to DeepMind’s findings, model, and dataset sizes should increase approximately three times for every tenfold increase in computing power. This implies that current models are below the optimal data/compute curve, and simply making them larger wouldn’t yield better results without an accompanying increase in data.

These new scaling laws recommend determining the maximum high-quality dataset size first, based on available information. Then, the optimal model size for that dataset can be chosen based on the available compute, using DeepMind’s data-optimal scaling laws. It is essential not to make models arbitrarily large but to find the right balance based on the data. Moreover, larger datasets require expertise to collect and diversity to improve generalization, which adds another layer of complexity to the process.

Data quality

Whether you are fine-tuning or building an LLM, the models will only be as good as the data they are provided. Therefore your data must be representative and extensively preprocessed. The diversity of data sources is beneficial even for domain-specific LLMs.

For example, Google’s Med-PaLM 2 required hundreds to thousands of examples from several health and medical QA datasets for the model to be properly fine-tuned. And to build BloombergGPT, the dataset mix used was 51% financial data and 49% general domain data for topical diversity. In both cases, researchers did extensive data curation to ensure high-quality and relevant data was used to build effective models.

Pre-training and experimenting in incremental steps

There is no direct path when it comes to building or fine-tuning an LLM. Like most machine learning or AI endeavors, it’s always a good measure to start small. Beginning with a smaller model on a smaller dataset will make experimentation easier. Iterating and introducing incremental changes to the model architecture, like width, depth, sparsity, etc., will make it easier for your team to evaluate their impact. You can start with older existing models, adjust them to your needs, and start from there. Once your smaller model works well, you can gradually increase the model and dataset size.

Remember to save snapshots of your model as a backup in case a fallback is needed. Setbacks will be inevitable whether you are building or fine-tuning, so it’s critical to anticipate issues.

Mitigating instability

This leads us to the critical time-saving practice of mitigating instability. The bigger the model, the more difficult it is to maintain. Beyond overfitting and underfitting, your model can undergo issues such as vanishing or exploding gradients, mode collapse, loss spikes, catastrophic forgetting, and hardware constraints.

We’ve already discussed catastrophic forgetting, which occurs when a model does poorly on a previous task after introducing a new type of task. Vanishing or exploding gradients are common issues in training deep neural networks, where the gradients become too small or too large, resulting in a slowdown or instability of the learning process. Mode collapse occurs in generative models and happens when the model generates the same outputs despite having different inputs. Loss spikes refer to increasingly poor predictions from the model. And finally, working with LLMs can be challenging for hardware and can result to failures.

There are several ways you can prepare or, at the very least, mitigate instabilities. These aren’t groundbreaking, but are standard practices you may also see in different deep learning applications:

Batch size — It’s generally advised to use the largest batch size that can fit your GPU memory. Larger batch sizes offer advantages in terms of computational efficiency, memory utilization, and potentially more accurate gradient estimation. They can better use the parallel processing capabilities of GPUs, resulting in faster training times.

Use Regularization techniques — Regularization techniques such as dropout and weight decay can help prevent overfitting or model instability.

Batch normalization — Batch normalization can help reduce internal covariate shift, allowing for faster and more stable convergence during training. It also helps alleviate the vanishing gradient problem by ensuring that the gradients flow more smoothly through the network.

Choosing the right weight initialization — Weight initialization plays a critical role in mitigating training instability and ensuring effective learning. One common approach to weight initialization is to use small Gaussian noise. This involves randomly initializing the weights from a Gaussian distribution with a mean of zero and a small standard deviation. By adding random noise, the weights are given initial diversity, enabling the model to explore different solutions during training.

Data Augmentation — If your model is struggling to generalize and is prone to overfitting, data augmentation can help mitigate this by introducing variations in the training data and improving the robustness of the model.

Learning Rate Scheduling — Gradually decrease your learning rate over time to lower your losses and maintain model stability as much as possible. You can use step decay or exponential decay. Step decay is when you reduce the learning rate by a factor at regular intervals, while exponential decay reduces the learning rate exponentially.

Conclusion

As businesses continue to realize the impact of LLMs, it only makes sense that companies start asking what is the best way to wield this valuable tool. Choosing the right LLM and fine-tuning it to suit your company’s needs may be the simpler option. However, it will still involve a variety of considerations to fine-tune the model effectively and in the most efficient way possible.

First, your team must be able to evaluate a variety of LLMs correctly to choose the best starting point. They should be comfortable with experimenting with the model. To be efficient, they need high-quality data and best practices in mind as they create a strategy around building or fine-tuning an LLM.

It’s a complex and ambitious project either way you choose, but as we’ve seen already, LLMs have impressive potential to bring value.

References:

[1] IBM, CEOs Embrace Generative AI as Productivity Jumps to the Top of their Agendas (2023), IBM Newsroom

[2] T. Eloundou, S. Manning, P. Mishkin, D. Rock, GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models (2023), arXiv:2303.10130v4 [econ.GN]

[3] McKinsey & Company, The economic potential of generative AI: The next productivity frontier (2023), McKinsey Digital Insights

[4] C. Li, Demystifying GPT-3 (2023), Lambda Labs Blog

[5] J. Langston, Microsoft announces new supercomputer, lays out vision for future AI work (2020), Microsoft News Source Features Innovation

[6] J. Sheikh, The ChatGPT of Finance is Here: Bloomberg is Combining AI and Fintech (2023), Forbes.com

[7] E. d’Archimbaud, 3 Ways to Adapt a Foundation Model to Fit Your Specific Needs (2023), Kili Technology.

[8] M. Heikkilä, How OpenAI is trying to make ChatGPT safer and less biased (2023), MIT Technology Review.

[9] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, M. Schaekermann, A. Wang, M. Amin, S. Lachgar, P. Mansfield, S. Prakash, B. Green, E. Dominowska, B. Aguera y Arcas, N. Tomasev, Y. Liu, R. Wong, C. Semturs, S. Sara Mahdavi, J. Barral, D. Webster, G. S. Corrado, Y. Matias, S. Aziziy, A. Karthikesalingamy, and V. Natarajany, Towards Expert-Level Medical Question Answering with Large Language Models (2023), arXiv:2305.09617v1 [cs.CL] 16 May 2023.

[10] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, BloombergGPT: A Large Language Model for Finance (2023), arXiv:2303.17564v2 [cs.LG]

[11] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, L. Sifre, Training Compute-Optimal Large Language Models (2023), arXiv:2203.15556v1 [cs.CL]

[12] A. Pai, Beginner’s Guide to Build Your Own Large Language Models from Scratch (2023), Analytics Vidhya.

[13] Weights & Biases, How to Train LLMs from Scratch (2023), Weights & Biases Whitepaper.