Creating a Domain Expert LLM: A Guide to Fine-Tuning

Introduction

With the release of ChatGPT, Large Language Models, or LLMs, have burst into the public consciousness. ChatGPT’s unique combination of creativity and coherence captured the public imagination, giving rise to many novel applications. There is now even a cottage industry of experts who specialize in Prompt Engineering, the practice of crafting prompts in order to get the desired behavior from popular LLM models - a skill that combines the analytical understanding of a software engineer and the linguistic intuition of a police interrogator. In a previous article, I have demonstrated how prompt engineering can bring powerful AI capabilities to technology applications.

As powerful as prompt engineering can be, there are significant limitations when it comes to commercial applications:

The context one can provide through prompt engineering is subject to the GPT model’s input limit. In the standard version of GPT-4, the limit is about 5,000 tokens, with each token corresponding to roughly one word. This is quite a lot of text and should be more than enough for most chat-based applications, but it is not enough to give ChatGPT any expertise it doesn’t already have.
Prompt engineering is mostly limited to natural language outputs. While it is easy to get reply values in JSON and/or XML formats with prompt engineering, the underlying reply is still based on natural language queries, and natural padding can often break the desired output format. This is fine if the human is reading the output, but if you have another script that’s running the post-processing code, this can be a problem.
Lastly, prompt engineering is, almost by definition, an inexact science. A prompt engineer must construct, through trial and error, a prediction about how ChatGPT will react and zero in on the correct prompt. This can be time-consuming, unpredictable, and potentially very difficult if the desired outcome is even moderately complicated.

To solve these issues, we can turn to a less-known but nevertheless useful technique called fine-tuning. Fine-tuning allows us to use a much larger body of text, control the input and output format, and generally exert more control over the LLM model in question.

In this article, we will look at what Fine Tuning is, build a small fine-tuned model to test its capabilities, and finally build a more substantial model using a larger input dataset.

I hope that reading this article will give you some ideas about how to use fine-tuning to improve your business applications. Without further ado, let’s dive in.

What is Fine Tuning?

Fine-tuning is OpenAI’s terminology for an API that lets users train a GPT model using their own data. Using this API, a user can create a copy of OpenAI’s LLM model and feed it their own training data consisting of example questions and ideal answers. The LLM is able to not only learn the information but understand the structure of the training data and cross-applies them to other situations. For example, OpenAI researchers have been able to use empathic questions and answers to create a model that is generally more empathic even when answering completely novel questions, and some commercial users have been able to create specialized systems that can look through case files for a lawyer and suggest new avenues of inquiry.

The API is quite easy to use. The user can simply create a JSONL (JSON Lines) file consisting of questions and answers and supply it to the OpenAI endpoint. OpenAI will then create a copy of the specified model and train it on the new data.

In the next section, we will walk through some test models to familiarize ourselves with the API and explore some of its basic capabilities before moving on to a larger endeavor.

API Demo: Building a Small Model

The Pig Latin Model

Before we start tackling big problems, let's first train a simple model to familiarize ourselves with the API. For this example, let’s try to build a model that can turn words into Pig Latin. Pig Latin, for those who do not know, is a simple word game that turns English words into Latin-sounding words through a simple manipulation of the syllables.

Generating Training Data

In order to train the model, we need to generate some example transformations to use as the training data. Therefore, we will need to define a function that turns a string into the Pig Latin version of the string. We will be using Python in this article, but you can use almost any major language to do the same:

def pig_latin(string):
    # if starts with a vowel, just add "ay"
    # else move the consonants to the end and add "ay"
    if string[0].lower() in {'a', 'e', 'i', 'o', 'u'}:
        return string + 'way'
    else:
        beginning_consonants = []
        for i in range(len(string)):
            if string[i].lower() in {'a', 'e', 'i', 'o', 'u'}:
                break
            beginning_consonants.append(string[i])
        return string[i:] + ''.join(beginning_consonants) + 'ay'

Now that we have our function, we will want to generate the training data. To do this, we can simply copy a body of text from the internet, extract the words from it, and turn it into Pig Latin.

passage = ''[passage from the internet]]'''
toks = [t.lower() for t in re.split(r'\s', passage) if len(t) > 0]
pig_latin_traindata = [
    {'prompt': 'Turn the following word into Pig Latin: %s \n\n###\n\n' % t, 'completion': '%s [DONE]' % pig_latin(t)}
    for t in toks
]

Notice a couple of things about this code. First, the training data is labeled such that the input is named “prompt” and the output is named “completion.” Second, the input starts with an instruction and ends with the separator “\n##\n.” This separate is used to indicate to the model that it should begin answering after the marker. Lastly, the completion always ends with the phrase “[DONE].” This is called a “stop sequence” and is used to help the model know when the answer has stopped. These manipulations are necessary due to the quirks in GPT’s design and is suggested in the OpenAI documentation.

The data file needs to be in JSONL format, which is simply a set of JSON objects delimited by new lines. Luckily, Pandas has a very simple shortcut for turning data frames into JSONL files, so we will simply rely on that today:

pd.DataFrame(pig_latin_traindata).to_json('pig_latin.jsonl', orient='records', lines=True)

Now that we have our training data saved as a JSONL file, we can begin training. Simply go to your terminal and run:

export OPENAI_API_KEY=[OPENAI_API_KEY]
openai api fine_tunes.create -t pig_latin.jsonl -m davinci --suffix pig_latin

Once the request is created, one simply has to check back later with the “fine_tunes.follow” command. The console output should give you the exact command for your particular training request, and you can run that from time to time to see if the training is done. The Fine-Tuning is done when you see something like this:

>> openai api fine_tunes.follow -i [finetune_id]
[2023-08-05 21:14:22] Created fine-tune: [finetune_id]
[2023-08-05 23:17:28] Fine-tune costs [cost]
[2023-08-05 23:17:28] Fine-tune enqueued. Queue number: 0
[2023-08-05 23:17:30] Fine-tune started
[2023-08-05 23:22:16] Completed epoch 1/4
[2023-08-05 23:24:09] Completed epoch 2/4
[2023-08-05 23:26:02] Completed epoch 3/4
[2023-08-05 23:27:55] Completed epoch 4/4
[2023-08-05 23:28:34] Uploaded model: [finetune_model_name]
[2023-08-05 23:28:35] Uploaded result file: [result_file_name]
[2023-08-05 23:28:36] Fine-tune succeeded

Testing

Grab the model name from the output file, and then you can simply test your model in Python like so:

import requests 

res = requests.post('https://api.openai.com/v1/completions', headers={
    'Content-Type': 'application/json',
    'Authorization': 'Bearer [OPENAI_ID]'
}, json={
    'prompt': “Turn the following word into Pig Latin: Latin“,
    'max_tokens': 500,
    'model': model_name,
    'stop': '[DONE]'
})

print(res.json()[‘choices’][0][‘text’])

And you should see the output:

atinlay

And with that, we have trained a Pig Latin LLM and have familiarized ourselves with the API! Of course, this is a criminal underutilization of GPT3’s capabilities, so in the next section we will build something much more substantial.

Building a Domain Expert Model

Now that we are familiar with the fine-tuning API let’s expand our imagination and think about what kinds of products we can build with fine-tuning. The possibilities are close to endless, but in my opinion, one of the most exciting applications of fine-tuning is the creation of a domain-expert LLM. This LLM would be trained on a large body of proprietary or private information and would be able to answer questions about the text and make inferences based on the training data.

Because this is a public tutorial, we will not be able to use any proprietary training data. Instead, we will use a body of text that is publicly available but not included in the training data for the base Davinci model. Specifically, we will teach the content of the Wikipedia synopsis of the Handel Opera Agrippina. This article is not present in the base model of Davinci, which is the best OpenAI GPT3 model commercially available for fine-tuning.

Verifying Base Model

Let’s first verify that the base model has no idea about the opera Agrippina. We can ask a basic question:

prompt = "Answer the following question about the Opera Agrippina: \n Who does Agrippina plot to secure the throne for? \n ### \n",
res = requests.post('https://api.openai.com/v1/completions', headers={
    'Content-Type': 'application/json',
    'Authorization': 'Bearer [OpenAI Key]'
}, json={
    'prompt': prompt,
    'max_tokens': 500,
    'model': 'davinci',
})

Print the result JSON, and you should see something like this:

{'id': 'cmpl-7kfyjMTDcxdYA3GjTwy3Xl6KNzoMz',
 'object': 'text_completion',
 'created': 1691358809,
 'model': 'davinci',
 'choices': [{'text': '\nUgo Marani in his groundbreaking 1988 monograph "La regina del mare: Agrippina minore e la storiografia" () criticized the usual view as myth,[15] stating that Agrippina and Nero both were under the illusion…,
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 30, 'completion_tokens': 500, 'total_tokens': 530}}

The passage seems to refer to Nero and Agrippina but appears to pertain to the historical figure rather than the opera. Additionally, the model seems to refer to imaginary sources, which suggests the base model’s training data likely did not have very detailed information about Agrippina and Nero.

Now that we know the base Davinci model is unaware of the opera let’s try and teach the content of the opera to our own Davinci model!

Obtaining and Cleaning the Training Data

We begin by downloading the text of the article from the Wikipedia API. Wikipedia has a well-tested and well-supported API that provides the wiki text in JSON format. We call the API like so:

import requests

res = requests.get('https://en.wikipedia.org/w/api.php', params={
    "action": "query",
    "format": "json",
    "prop": "revisions",
    "titles": "Agrippina_(opera)",
    "formatversion": "2",
    "rvprop": "content",
    "rvslots": "*"
})

rs_js = res.json()
print(rs_js['query']['pages'][0]['revisions'][0]['slots']['main']['content'])

Now that we have the latest text data let’s do some text manipulation to remove Wiki tags.

import re
…

def remove_tags(string, tag):
    toks = string.split(f'<{tag}')
    new_toks = []
    for tok in toks:
        new_toks.append(tok.split(f'</{tag}>')[-1])
    return ''.join(new_toks)

processed = re.sub(r'\[\[File:[^\n]+', '', rs_js['query']['pages'][0]['revisions'][0]['slots']['main']['content'])
processed = re.sub(r'\[\[([^|\]]+)\|([^\]]+)\]\]', r'\2', processed)
processed = remove_tags(processed, 'ref')
processed = remove_tags(processed, 'blockquote')
processed = processed.replace('[[', '').replace(']]', '')
processed = re.sub(r'\{\{[^\}]+\}\}', r'', processed)
processed = processed.split('== References ==')[0]
processed = re.sub(r'\'{2}', '', processed)

print(processed)

It doesn’t remove all of the tags and non-natural text elements but should remove enough of the tags that it is readable as natural text.

Next, we want to convert the text into a hierarchical representation based on the headers:

hierarchy_1 = 'Introduction'
hierarchy_2 = 'Main'
hierarchical_data = defaultdict(lambda: defaultdict(list))

for paragraph in processed.split('\n'):
    if paragraph == '':
        continue
    if paragraph.startswith('==='):
        hierarchy_2 = paragraph.split('===')[1]
    elif paragraph.startswith('=='):
        hierarchy_1 = paragraph.split('==')[1]
        hierarchy_2 = 'Main'
    else:
        print(hierarchy_1, hierarchy_2)
        hierarchical_data[hierarchy_1][hierarchy_2].append(paragraph)

Constructing the Training Data

Now that we have our passage, we need to turn the passage into training data. While we can always read the passages and manually write the training data, for large bodies of text, it can quickly become prohibitively time-consuming. In order to have a scalable solution, we will want a more automated way to generate the training data.

An interesting way to generate appropriate training data from the passage is to supply sections of the passage to ChatGPT and ask it to generate the prompts and completions using prompt engineering. This may sound like circular training - why not just let ChatGPT analyze the passage if that’s the case? The answer to that question, of course, is scalability. Using this method, we can break up large bodies of text and generate training data piecemeal, allowing us to process bodies of text that can go beyond what can be given to ChatGPT as input.

In our model, for example, we will break up the synopsis into Act 1, Act 2, and Act 3. Then, by modifying the training data to provide additional context, we can help the model draw connections between the passages. With this method, we can scalably create training data from large input data, which will be the key to building domain-expert models that can solve problems in math, science, or finance.

We begin by generating two sets of prompts and completions for each act, one with lots of detail and one with simple questions and answers. We do this so the model can answer both simple, factual questions as well as long, complex questions.

To do so, we create two functions with slight differences in the prompt:

def generate_questions(h1, h2, passage):
    completion = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=[
                {"role": "user", "content": '''
            Consider the following passage from the wikpedia article on Agrippina, %s, %s:
            ---
            %s
            ---
            Generate 20 prompts and completions pairs that would teach a davinci GPT3 model the content of this passage. 
            Prompts should be complete questions.
            Completions should contain plenty of context so davinci can understand the flow of events, character motivations, and relationships.
            Prompts and completions should be long and detailed. 
            Reply in JSONL format
                ''' % (h1, h2, passage)},
              ]
            )
    return completion

def generate_questions_basic(h1, h2, passage):
    completion = openai.ChatCompletion.create(
              model="gpt-3.5-turbo",
              messages=[
                {"role": "user", "content": '''
            Consider the following passage from the wikpedia article on Agrippina, %s, %s:
            ---
            %s
            ---
            Generate 20 prompts and completions pairs that would teach a davinci GPT3 model the content of this passage. 
            Reply in JSONL format
                ''' % (h1, h2, passage)},
              ]
            )
    return completion

Then we call the functions and collect the results into a data container:

questions = defaultdict(lambda: defaultdict(list))
for h_1, h1_data in hierarchical_data.items():
    if h_1 != 'Synopsis':
        continue
    for h_2, h2_data in h1_data.items():
        print('==========', h_1, h_2, '===========')
        passage = '\n\n'.join(h2_data)
        prompts_completion = generate_questions(h_1, h_2, passage)
        prompts_completion_basic = generate_questions_basic(h_1, h_2, passage)

        questions[h_1][h_2] = {
            'passage': passage,
            'prompts_completion': prompts_completion,
            'prompts_completion_basic': prompts_completion_basic
        }

And then, we can convert the generated questions from JSON into objects. We will need to add an error handling block because sometimes ChatGPT will generate outputs that are not JSON decodable. In this case, we will just flag and print the offending record to the console:

all_questions = []
for h1, h1_data in questions.items():
    for h2, h2_data in h1_data.items():
        for key in ['prompts_completion', 'prompts_completion_basic']:
            for ob in h2_data[key].choices[0]['message']['content'].split('\n'):
                try:
                    js = json.loads(ob)
                    js['h1'] = h1
                    js['h2'] = h2
                    all_questions.append(js)
                except Exception:
                    print(ob)

df = pd.DataFrame(all_questions)

Because ChatGPT is not deterministic (that is, each time you query ChatGPT, you may get a different output even if your input is the same), your experience may vary from mine, but in my case, the questions were all parsed without issue. Now we have our training data in a data frame.

We’re almost there! Let’s add a couple of finishing touches to the training data, including basic context, end markers to the prompts, and stop sequences to the completions.

df['prompt'] = df.apply(
    lambda row: 'Answer the following question about the Opera Agrippina, Section %s, subsection %s: \n %s \n ### \n'  % (
        row['h1'], row['h2'], row['prompt']
    ), axis=1)

df['completion'] = df['completion'].map(lambda x: f'{x} [DONE]')

Inspect the test data, and you should see a variety of training questions and answers. You may see short prompt-completion pairs such as:

Answer the following question about the Opera Agrippina, Section Synopsis, subsection Act 2: 
 What happens as Claudius enters? 
 ### 

All combine in a triumphal chorus. [DONE]

As well as long prompt-completion pairs like:

Answer the following question about the Opera Agrippina, Section Synopsis, subsection Act 3: 
 Describe the sequence of events when Nero arrives at Poppaea's place. 
 ### 

When Nero arrives, Poppaea tricks him into hiding in her bedroom. She then summons Claudius, informing him that he had misunderstood her earlier rejection. Poppaea convinces Claudius to pretend to leave, and once he does, she calls Nero out of hiding. Nero, thinking Claudius has left, resumes his passionate wooing of Poppaea. However, Claudius suddenly reappears and dismisses Nero in anger. [DONE]

Now we’re finally ready to commence training! Write the dataframe to file:

with open('agrippina_training.jsonl', 'w') as fp_agrippina:
    fp_agrippina.write(df[['prompt', 'completion']].to_json(orient='records', lines=True))

And call the fine-tuning API like so:

openai api fine_tunes.create -t agrippina_training.jsonl -m davinci --suffix agrippina

And now, we can begin our testing. You can find the code used to generate the training data can be found here.

Testing

When the training is done, we can test the model by grabbing the model name and querying it as we did in the earlier section:

import openai
import requests

model = [Your Model Name]

def query_prompt(prompt):
    res = requests.post('https://api.openai.com/v1/completions', headers={
      'Content-Type': 'application/json',
      'Authorization': 'Bearer [OPENAI_API_KEY]
    }, json={
      'prompt': prompt,
      'max_tokens': 500,
      'model': model,
      'stop': '[DONE]'
    })
    print(“output”)
    print(res.json()[‘choices’][0][‘text’])

Let’s begin by querying a prompt that’s in the training dataset:

query_prompt(“Answer the following question about the Opera Agrippina: \n Who does Agrippina plot to secure the throne for? \n ### \n")

You should see a result similar to the one below:

Agrippina plots to secure the throne for Nero, her son by a former marriage.

And now let’s see if we can ask a novelly-worded question:

query_prompt("Answer the following question about the Opera Agrippina: \n What does Claudius promise Otho as a reward for saving his life? \n ### \n")

The model should give an answer like:

The throne

So it seems the model is able to recall simple facts in the training data. How will it perform with a novelly-worded question that requires a more detailed answer? Let’s see:

query_prompt("Answer the following question about the Opera Agrippina: \n What is Agrippina's state of mind after Pallas and Narcissus turn against her? \n ### \n")

The model should give an answer like:

After Pallas and Narcissus turn against her, Agrippina finds herself in a precarious situation. She realizes that she no longer has a strategic advantage over Claudius, as her trusted allies have betrayed her and her secret plan has been uncovered. Agrippina's state of mind is one of anxiety and despair, as her carefully crafted manipulation has unraveled and exposed her for the manipulative and deceitful individual that she is. She must now carefully reconsider her next move, weighing the potential repercussions of each option carefully. Agrippina is rattled by the betrayal of Pallas and Narcissus and faces an uncertain future.

The model is definitely making some inferences here and is not saying exactly what’s in the synopsis, but the answer is largely cogent.

As a final test, let’s see if it is able to answer a complex question that requires it to draw connections between the various passages:

query_prompt("Answer the following question about the Opera Agrippina: \n How does Agrippina's fortune change through the opera? \n ### \n")

You should see an answer like:

Agrippina's fortune changes drastically throughout the course of the opera. Initially, she is adamant about securing the throne for Nero as a means to protect her ambitions for power. However, after Poppaea's intervention, Agrippina is not certain about her control over the situation. She is worried about losing Nero's favor to Poppaea and fears that her plotting has been unraveled by Otho. Agrippina is aware that her authority is slowly slipping away and that her control over the situation is diminishing as time progresses. The seeds of doubt have been planted in her mind, and she strives to maintain her grip over Nero while trying to anticipate the next move of her rivals. Agrippina's fortune fluctuates as she attempts to maintain her influence over Nero and the empire, facing challenges from multiple fronts.

This is very promising! Based on this answer, we can see the model has digested quite a bit of information and is able to draw connections between the various bits of information we gave it during the training.

Of course, the model is by no means perfect. It is still prone to hallucinations and confused answers when the prompt is highly complex. Querying the model repeatedly with the same prompt can sometimes yield dramatically different results. However, remember that we used a relatively small body of training data, and we relied solely on ChatGPT to generate the prompts and completions. If we preprocessed the input data, crafted more detailed training data, and generated more sample prompts and completions, we would likely be able to improve the model’s performance further.

If you want to explore this topic more, please feel free to get in touch or play with the API on your own. All of the code I used in this article can be found on my Github, and you can get in touch with me through my GitHub page as well.

Summary

Today, we explored OpenAI’s fine-tuning API and explored how we can use fine-tuning techniques to give a GPT model new knowledge. Even though we used publicly available text data for our experiment, the same techniques can be adapted to proprietary datasets. There is almost unlimited potential for what fine-tuning can do with the right training, and I hope this article inspired you to think about how you can use fine-tuning in your business or application.

If you want to discuss potential applications for LLM technologies, feel free to drop me a line through my Github page!