100 Days of AI, Day 13: How Instruction Finetuning Improves a Pre-trained LLM

Hey everyone! I’m Nataraj, and just like you, I’ve been fascinated with the recent progress of artificial intelligence. Realizing that I needed to stay abreast with all the developments happening, I decided to embark on a personal journey of learning, thus 100 days of AI was born! With this series, I will be learning about LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey on HackerNoon here or my personal website here. In one of the previous posts, we talked about finetuning and why it is important. In this post we will take a look at a specific kind of finetuning called Instruction Finetuning. Limitations of Pre-Trained Base Models: Pretrained base models like gpt-3 are trained on a vast amounts of data. In case of gpt-3 its all the data on the internet. Well we don’t know that for sure but most of these models are trained on internet scale data after considerable manual clean up and formatting. As they are trained the based models learn how to predict the next token and get really good at token prediction. But pure token prediction is not as useful as you would think. If you ask a pre-trained base model “What is the capital of Mexico?” it will not reply with an answer but might complete the input sentence with “What is the capital of Columbia“. So even though a model like gpt-3 is powerful at token prediction it will not work as a chatbot or a copilot. So how do we convert a pre-trained model to a useful chatbot like chat-gpt? The answer is finetuning, mainly a specific type of finetuning called “Instruction Finetuning“. What is instruction finetuning? Instruction finetuning also referred as “instruction-following” is a process to teach a pre-trained base model to behave like a chat bot. Instruction finetuning needs data sets which are in the form of question and answers. You can use public data sets or your companies data set which is in the form of Q&A. If your data set is not in the form of Q&A you can convert the data into Q&A using different techniques like Alpaca or using custom prompts on other LLMs. Note that instruction finetuning gives the model a new behavior of answering questions not just on the data that you use in finetuning, but this new behavior is applicable to the existing knowledge the model already has which makes finetuning a powerful technique. Instruction Finetuning Using Lamini: Lamini is an AI company that allows developers to deal with language models in an easy way abstracting away the complexity of hosting, training and other complicated aspects. Check out its full capabilities here. We will use Lamini to train small launguage model called pythia, which is an opensource model created by Eleuther AI and do instruction finetuning on it using a company dataset called Alpaca. Step 1: Initialize and load Instruction Finetuning dataset In this step lets initialize the required module and also look at the alpaca training data set. Here’s the code. import itertools import jsonlines from datasets import load_dataset from pprint import pprint from llama import BasicModelRunner from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import AutoModelForSeq2SeqLM, AutoTokenizer ## we are using alpaca data set, which is an open source fine tuning data set instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True) m = 5 print("Instruction-tuned dataset:") top_m = list(itertools.islice(instruction_tuned_dataset, m)) for j in top_m: print(j) This is how the instruction tuning data set looks like. It contains data in the form of questions and answers. Step 2: Hydrate the prompts In this step we take the data from the alpaca set and put them in to the prompts show below. prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response:""" prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response:""" ## hydrate prompts - meaning add data to the above prompts processed_data = [] for j in top_m: if not j["input"]: processed_prompt = prompt_template_without_input.format(instruction=j["instruction"]) else: processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"]) processed_data.append({"input": processed_prompt, "output": j["output"]}) After doing this the data set will look as follows. We are basically taking the raw Q&A data and converting in to a format that makes sense to the LLM that when asked a question how should the response for that question should look like. We do this iteratively and store in a jsonl file. with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer: writer.write_all(processed_data) Step 3 – Non-Finetuned Output In step 1 & 2 we loaded raw data and hydrated it and stored in jsonl format. But Lamini has this hydrated data ready to go, so technically step 1 & 2 are not necessary. But it was needed to show to understand how instruction finetuning works. Let’s first see how a non-finetuned version of Pythia model would respond to a simple question. tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m") #70M parameter model that is not instruction tuned. model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m") def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100): # Tokenize input_ids = tokenizer.encode( text, return_tensors="pt", truncation=True, max_length=max_input_tokens ) # Generate device = model.device generated_tokens_with_prompt = model.generate( input_ids=input_ids.to(device), max_length=max_output_tokens ) # Decode generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True) # Strip the prompt generated_text_answer = generated_text_with_prompt[0][len(text):] return generated_text_answer ## the 70M model doesnt have any company specific data, we will use the alpace data set from hosted on lamini and fine tune this model # load alpaca dataset finetuning_dataset_path = "lamini/lamini_docs" finetuning_dataset = load_dataset(finetuning_dataset_path) #print(finetuning_dataset) test_sample = finetuning_dataset["test"][0] print(test_sample) print("untrained output sample") print(inference(test_sample["question"], model, tokenizer)) This is the output I got. You will notice that the output is not helpful and the model is trying to do token completion and is not giving an actual answer. Step 4 – Instruction Finetuned Output Once we use the q&a data seen in the previous step is used to instruction fine tune the same model will start to behave like a chat bot and will provide more accurate answers to your questions both on the fineunted data but also the data that the model already consists off. Its almost like when a child learns a language the first time, he or she will now be able to express the feelings they already have along with the new things that they learnt becuase of the language training. Just like the pretrained version of model, the instruction finetuned model is also hosted on Lamini and can be infered with a command as show below. (Yes Lamini is great!) ## finetuned output instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned") print("instruction finetuned output") print(inference(test_sample["question"], instruction_model, tokenizer)) Here is what the output will look like. You will note that instead of the gibberish we have seen in the previous step we have a more accurate output. The goal with this post is to give an intro to instruction finetuning and how it is used to make base models to more usable versions. In future posts I will dive deep into the actual process of doing instruction finetuning. That’s it for Day 13 of 100 Days of AI. I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it. Follow me on Twitter, LinkedIn for latest updates on 100 days of AI. If you are in tech you might be interested in joining my community of tech professionals here. Also appears here. Hey everyone! I’m Nataraj, and just like you, I’ve been fascinated with the recent progress of artificial intelligence. Realizing that I needed to stay abreast with all the developments happening, I decided to embark on a personal journey of learning, thus 100 days of AI was born! With this series, I will be learning about LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey on HackerNoon here or my personal website here. Hey everyone! I’m Nataraj, and just like you, I’ve been fascinated with the recent progress of artificial intelligence. Realizing that I needed to stay abreast with all the developments happening, I decided to embark on a personal journey of learning, thus 100 days of AI was born! With this series, I will be learning about LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey on HackerNoon here or my personal website here. Hey everyone! I’m Nataraj, and just like you, I’ve been fascinated with the recent progress of artificial intelligence. Realizing that I needed to stay abreast with all the developments happening, I decided to embark on a personal journey of learning, thus 100 days of AI was born! With this series, I will be learning about LLMs and share ideas, experiments, opinions, trends & learnings through my blog posts. You can follow along the journey on HackerNoon here or my personal website here . I’m Nataraj , I’m Nataraj I’m Nataraj 100 days of AI 100 days of AI 100 days of AI here here here here here here In one of the previous posts, we talked about finetuning and why it is important. In this post we will take a look at a specific kind of finetuning called Instruction Finetuning. Instruction Finetuning. Limitations of Pre-Trained Base Models: Pretrained base models like gpt-3 are trained on a vast amounts of data. In case of gpt-3 its all the data on the internet. Well we don’t know that for sure but most of these models are trained on internet scale data after considerable manual clean up and formatting. As they are trained the based models learn how to predict the next token and get really good at token prediction. But pure token prediction is not as useful as you would think. If you ask a pre-trained base model “ What is the capital of Mexico? ” it will not reply with an answer but might complete the input sentence with “ What is the capital of Columbia “. So even though a model like gpt-3 is powerful at token prediction it will not work as a chatbot or a copilot. So how do we convert a pre-trained model to a useful chatbot like chat-gpt? The answer is finetuning, mainly a specific type of finetuning called “ Instruction Finetuning “. What is the capital of Mexico? What is the capital of Columbia Instruction Finetuning What is instruction finetuning? Instruction finetuning also referred as “instruction-following” is a process to teach a pre-trained base model to behave like a chat bot. Instruction finetuning needs data sets which are in the form of question and answers. You can use public data sets or your companies data set which is in the form of Q&A. If your data set is not in the form of Q&A you can convert the data into Q&A using different techniques like Alpaca or using custom prompts on other LLMs. Note that instruction finetuning gives the model a new behavior of answering questions not just on the data that you use in finetuning, but this new behavior is applicable to the existing knowledge the model already has which makes finetuning a powerful technique. Instruction finetuning Instruction Finetuning Using Lamini: Lamini is an AI company that allows developers to deal with language models in an easy way abstracting away the complexity of hosting, training and other complicated aspects. Check out its full capabilities here . We will use Lamini to train small launguage model called pythia , which is an opensource model created by Eleuther AI and do instruction finetuning on it using a company dataset called Alpaca. here pythia Eleuther AI instruction finetuning Step 1: Initialize and load Instruction Finetuning dataset Step 1: Initialize and load Instruction Finetuning dataset In this step lets initialize the required module and also look at the alpaca training data set. Here’s the code. import itertools import jsonlines from datasets import load_dataset from pprint import pprint from llama import BasicModelRunner from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import AutoModelForSeq2SeqLM, AutoTokenizer ## we are using alpaca data set, which is an open source fine tuning data set instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True) m = 5 print("Instruction-tuned dataset:") top_m = list(itertools.islice(instruction_tuned_dataset, m)) for j in top_m: print(j) import itertools import jsonlines from datasets import load_dataset from pprint import pprint from llama import BasicModelRunner from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import AutoModelForSeq2SeqLM, AutoTokenizer ## we are using alpaca data set, which is an open source fine tuning data set instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True) m = 5 print("Instruction-tuned dataset:") top_m = list(itertools.islice(instruction_tuned_dataset, m)) for j in top_m: print(j) This is how the instruction tuning data set looks like. It contains data in the form of questions and answers. Step 2: Hydrate the prompts In this step we take the data from the alpaca set and put them in to the prompts show below. prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response:""" prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response:""" ## hydrate prompts - meaning add data to the above prompts processed_data = [] for j in top_m: if not j["input"]: processed_prompt = prompt_template_without_input.format(instruction=j["instruction"]) else: processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"]) processed_data.append({"input": processed_prompt, "output": j["output"]}) prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response:""" prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Response:""" ## hydrate prompts - meaning add data to the above prompts processed_data = [] for j in top_m: if not j["input"]: processed_prompt = prompt_template_without_input.format(instruction=j["instruction"]) else: processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"]) processed_data.append({"input": processed_prompt, "output": j["output"]}) After doing this the data set will look as follows. We are basically taking the raw Q&A data and converting in to a format that makes sense to the LLM that when asked a question how should the response for that question should look like. We do this iteratively and store in a jsonl file. jsonl with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer: writer.write_all(processed_data) with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer: writer.write_all(processed_data) Step 3 – Non-Finetuned Output In step 1 & 2 we loaded raw data and hydrated it and stored in jsonl format. But Lamini has this hydrated data ready to go, so technically step 1 & 2 are not necessary. But it was needed to show to understand how instruction finetuning works. Let’s first see how a non-finetuned version of Pythia model would respond to a simple question. jsonl tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m") #70M parameter model that is not instruction tuned. model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m") def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100): # Tokenize input_ids = tokenizer.encode( text, return_tensors="pt", truncation=True, max_length=max_input_tokens ) # Generate device = model.device generated_tokens_with_prompt = model.generate( input_ids=input_ids.to(device), max_length=max_output_tokens ) # Decode generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True) # Strip the prompt generated_text_answer = generated_text_with_prompt[0][len(text):] return generated_text_answer ## the 70M model doesnt have any company specific data, we will use the alpace data set from hosted on lamini and fine tune this model # load alpaca dataset finetuning_dataset_path = "lamini/lamini_docs" finetuning_dataset = load_dataset(finetuning_dataset_path) #print(finetuning_dataset) test_sample = finetuning_dataset["test"][0] print(test_sample) print("untrained output sample") print(inference(test_sample["question"], model, tokenizer)) tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m") #70M parameter model that is not instruction tuned. model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m") def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100): # Tokenize input_ids = tokenizer.encode( text, return_tensors="pt", truncation=True, max_length=max_input_tokens ) # Generate device = model.device generated_tokens_with_prompt = model.generate( input_ids=input_ids.to(device), max_length=max_output_tokens ) # Decode generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True) # Strip the prompt generated_text_answer = generated_text_with_prompt[0][len(text):] return generated_text_answer ## the 70M model doesnt have any company specific data, we will use the alpace data set from hosted on lamini and fine tune this model # load alpaca dataset finetuning_dataset_path = "lamini/lamini_docs" finetuning_dataset = load_dataset(finetuning_dataset_path) #print(finetuning_dataset) test_sample = finetuning_dataset["test"][0] print(test_sample) print("untrained output sample") print(inference(test_sample["question"], model, tokenizer)) This is the output I got. You will notice that the output is not helpful and the model is trying to do token completion and is not giving an actual answer. Step 4 – Instruction Finetuned Output Step 4 – Instruction Finetuned Output Once we use the q&a data seen in the previous step is used to instruction fine tune the same model will start to behave like a chat bot and will provide more accurate answers to your questions both on the fineunted data but also the data that the model already consists off. Its almost like when a child learns a language the first time, he or she will now be able to express the feelings they already have along with the new things that they learnt becuase of the language training. Just like the pretrained version of model, the instruction finetuned model is also hosted on Lamini and can be infered with a command as show below. (Yes Lamini is great!) ## finetuned output instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned") print("instruction finetuned output") print(inference(test_sample["question"], instruction_model, tokenizer)) ## finetuned output instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned") print("instruction finetuned output") print(inference(test_sample["question"], instruction_model, tokenizer)) Here is what the output will look like. You will note that instead of the gibberish we have seen in the previous step we have a more accurate output. The goal with this post is to give an intro to instruction finetuning and how it is used to make base models to more usable versions. In future posts I will dive deep into the actual process of doing instruction finetuning. instruction finetuning instruction finetuning. That’s it for Day 13 of 100 Days of AI. That’s it for Day 13 of 100 Days of AI. I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it. I write a newsletter called Above Average where I talk about the second order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it . subscribe to it Follow me on Twitter , LinkedIn for latest updates on 100 days of AI. If you are in tech you might be interested in joining my community of tech professionals here . Twitter Twitter LinkedIn LinkedIn here here Also appears here. Also appears here . here