Fine-Tuning LLMs: A Comprehensive Tutorial

It costs millions of dollars and months of computing time to train a large language model from the ground up. You most likely never need to do it. Fine-tuning lets you adapt pre-trained language models to your needs in hours or days, not months, with a fraction of the resources. This tutorial takes you from theory to practice: you’ll learn the four core fine-tuning techniques, code a complete training pipeline in Python, and learn the techniques that separate production-ready models from expensive experiments. from theory to practice What Is LLM Fine-Tuning? Fine-tuning trains an existing language model on your data to enhance its performance on specific tasks. Pre-trained models are powerful generalists, but exposing them to focused examples can transform them into specialists for your use case. performance on specific tasks focused examples Instead of building a model from scratch (which requires massive compute and data), you're giving an already-capable model a crash course in what matters to you, whether that's medical diagnosis, customer support automation, sentiment analysis, or any other particular task. How Does LLM Fine-Tuning Work? Fine-tuning continues the training process on pre-trained language models using your specific dataset. The model processes your provided examples, compares its own outputs to the expected results, and updates internal weights to adapt and minimize loss. updates internal weights This approach can vary based on your goals, available data, and computational resources. Some projects require full fine-tuning, where you update all model parameters, while others work better with parameter-efficient methods like LoRA that modify only a small subset. goals available data computational resources LoRA LLM Fine-Tuning Methods Supervised Fine-Tuning SFT teaches the model to learn the patterns of the correct question-answer pairs and adjusts model weights to match those answers exactly. You need a dataset of (Prompt, Ideal Response) pairs. Use this when you want consistent outputs, like making the model always respond in JSON format, following your customer service script, or writing emails in your company’s tone. (Prompt, Ideal Response) Unsupervised Fine-Tuning Feeds the model tons of raw text (no questions or labeled data needed) so it learns the vocabulary and patterns of a particular domain. While this is technically a pre-training process known as Continued Pre-Training (CPT), this is usually done after the initial pre-training phase. Use this first when your model needs to understand specialized content it wasn't originally trained on, like medical terminology, legal contracts, or a new language. Direct Preference Optimization DPO teaches the model to prefer better responses by showing examples of good vs. bad answers to the same question and adjusting it to favor the good ones. Needs (Prompt, Good Response, Bad Response) triplets. Use DPO after basic training to fix annoying behaviors like stopping the model from making things up, being too wordy, or giving unsafe answers. (Prompt, Good Response, Bad Response) Reinforcement Fine-Tuning In RLHF, you first train a reward model on prompts with multiple responses ranked by humans, teaching it to predict which responses people prefer. Then, you use reinforcement learning to optimize and fine-tune a model that generates responses, which the reward model judges. This helps the model learn over time to produce higher-scoring outputs. This process requires datasets in this format: (Prompt, [Response A, Response B, ...], [Rankings]). It’s best for tasks where judging quality is easier than creating perfect examples, like medical diagnoses, legal research, and other complex domain-specific reasoning. (Prompt, [Response A, Response B, ...], [Rankings]) Step-by-Step Fine-Tuning LLMs Tutorial We'll walk you through every step of fine-tuning a small pre-trained model to solve word-based math problems, something it struggles with out of the box. We’ll use the Qwen 2.5 base model with 0.5B parameters that already has natural language processing capabilities. The approach works for virtually any use case of fine-tuning LLMs: teaching a model specialized terminology, improving the model's performance on specific tasks, or adapting it to your domain. works for virtually any use case of fine-tuning LLMs Prerequisites Install a few Python packages that we’ll use throughout this tutorial. In a new project folder, create and activate a Python virtual environment, and then install these libraries using pip or your preferred package manager: pip pip install requests datasets transformers 'transformers[torch]' pip install requests datasets transformers 'transformers[torch]' 1. Get & Load the Dataset The fine-tuning process starts with choosing the dataset, which is arguably the most important decision. The dataset should directly reflect the task you want your model to perform. reflect the task you want your model to perform Simple tasks like sentiment analysis need basic input-output pairs. Complex tasks like instruction following or question-answering require richer datasets with context, examples, and varied formats. Fine-tuning data quality and size directly impact training time and your model's performance. The easiest starting point is the Hugging Face dataset library, which hosts thousands of open-source datasets for different domains and tasks. Need something specific and high-quality? Purchase specialized datasets or build your own by scraping publicly available data. Hugging Face Purchase specialized datasets build your own by scraping scraping For example, if you want to build a sentiment analysis model for Amazon product reviews, you may want to collect data from real reviews using a web scraping tool. Here's a simple example that uses Oxylabs Web Scraper API: Web Scraper API import json import requests # Web Scraper API parameters. payload = { "source": "amazon_product", # Query is the ASIN of a product. "query": "B0DZDBWM5B", "parse": True, } # Send a request to the API and get the response. response = requests.post( "https://realtime.oxylabs.io/v1/queries", # Visit https://dashboard.oxylabs.io to claim FREE API tokens. auth=("USERNAME", "PASSWORD"), json=payload, ) print(response.text) # Extract the reviews from the response. reviews = response.json()["results"][0]["content"]["reviews"] print(f"Found {len(reviews)} reviews") # Save the reviews to a JSON file. with open("reviews.json", "w") as f: json.dump(reviews, f, indent=2)  For this tutorial, let’s keep it simple without building a custom data collection pipeline. Since we're teaching the base model to solve word-based math problems, we can use the openai/gsm8k dataset. It’s a collection of grade-school math problems with step-by-step solutions. Load it in your Python file: from datasets import load_dataset dataset = load_dataset("openai/gsm8k", "main") print(dataset["train"][0]) import json import requests # Web Scraper API parameters. payload = { "source": "amazon_product", # Query is the ASIN of a product. "query": "B0DZDBWM5B", "parse": True, } # Send a request to the API and get the response. response = requests.post( "https://realtime.oxylabs.io/v1/queries", # Visit https://dashboard.oxylabs.io to claim FREE API tokens. auth=("USERNAME", "PASSWORD"), json=payload, ) print(response.text) # Extract the reviews from the response. reviews = response.json()["results"][0]["content"]["reviews"] print(f"Found {len(reviews)} reviews") # Save the reviews to a JSON file. with open("reviews.json", "w") as f: json.dump(reviews, f, indent=2)  For this tutorial, let’s keep it simple without building a custom data collection pipeline. Since we're teaching the base model to solve word-based math problems, we can use the openai/gsm8k dataset. It’s a collection of grade-school math problems with step-by-step solutions. Load it in your Python file: from datasets import load_dataset dataset = load_dataset("openai/gsm8k", "main") print(dataset["train"][0]) 2. Tokenize the Data for Processing Models don't understand text directly; they work with numbers. Tokenization converts your text into tokens (numerical representations) that the model can process. Every model has its own tokenizer trained alongside it, so use the one that matches your base model: numbers from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer.pad_token = tokenizer.eos_token from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer.pad_token = tokenizer.eos_token How we tokenize our data shapes what the model learns. For math problems, we want to fine-tune the model to learn how to answer questions, not generate them. Here's the trick: tokenize questions and answers separately, then use a masking technique. answer Setting question tokens to -100 tells the training process to ignore them when calculating loss. The model only learns from the answers, making training more focused and efficient. -100 def tokenize_function(examples): input_ids_list = [] labels_list = [] for question, answer in zip(examples["question"], examples["answer"]): # Tokenize question and answer separately question_tokens = tokenizer(question, add_special_tokens=False)["input_ids"] answer_tokens = tokenizer(answer, add_special_tokens=False)["input_ids"] + [tokenizer.eos_token_id] # Combine question + answer for input input_ids = question_tokens + answer_tokens # Mask question tokens with -100 so loss is only computed on the answer labels = [-100] * len(question_tokens) + answer_tokens input_ids_list.append(input_ids) labels_list.append(labels) return { "input_ids": input_ids_list, "labels": labels_list, } def tokenize_function(examples): input_ids_list = [] labels_list = [] for question, answer in zip(examples["question"], examples["answer"]): # Tokenize question and answer separately question_tokens = tokenizer(question, add_special_tokens=False)["input_ids"] answer_tokens = tokenizer(answer, add_special_tokens=False)["input_ids"] + [tokenizer.eos_token_id] # Combine question + answer for input input_ids = question_tokens + answer_tokens # Mask question tokens with -100 so loss is only computed on the answer labels = [-100] * len(question_tokens) + answer_tokens input_ids_list.append(input_ids) labels_list.append(labels) return { "input_ids": input_ids_list, "labels": labels_list, } Apply this tokenization function to both training and testing datasets. We filter out examples longer than 512 tokens to keep memory usage manageable and ensure the model processes complete information without truncation. Shuffling the training data helps the model learn more effectively: train_dataset = dataset["train"].map( tokenize_function, batched=True, remove_columns=dataset["train"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) .shuffle(seed=42) eval_dataset = dataset["test"].map( tokenize_function, batched=True, remove_columns=dataset["test"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) print(f"Samples: {len(dataset['train'])} → {len(train_dataset)} (after filtering)") print(f"Samples: {len(dataset['test'])} → {len(eval_dataset)} (after filtering)") train_dataset = dataset["train"].map( tokenize_function, batched=True, remove_columns=dataset["train"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) .shuffle(seed=42) eval_dataset = dataset["test"].map( tokenize_function, batched=True, remove_columns=dataset["test"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) print(f"Samples: {len(dataset['train'])} → {len(train_dataset)} (after filtering)") print(f"Samples: {len(dataset['test'])} → {len(eval_dataset)} (after filtering)") Optional: Optional: Want to test the entire pipeline quickly before committing to a full training run? You can train the model on a subset dataset. So, instead of using the full 8.5K dataset, you can minimize it to 3K in total, making the process much faster: train_dataset = train_dataset.select(range(2000)) eval_dataset = eval_dataset.select(range(1000)) train_dataset = train_dataset.select(range(2000)) eval_dataset = eval_dataset.select(range(1000)) Keep in mind: smaller datasets increase overfitting risk, where the model memorizes training data rather than learning general patterns. For production, aim for at least 5K+ training samples and carefully tune your hyperparameters. Keep in mind: 3. Initialize the Base Model Next, load the pre-trained base model to fine-tune it by improving its math problem-solving abilities: from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") model.config.pad_token_id = tokenizer.pad_token_id from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") model.config.pad_token_id = tokenizer.pad_token_id 4. Fine-Tune Using the Trainer Method This is where the magic happens. TrainingArguments controls how your model learns (think of it as the recipe determining your final result's quality). These settings and hyperparameters can make or break your fine-tuning, so experiment with different values to find what works for your use case. Key parameters explained: Key parameters explained: ● Epochs: More epochs equal more learning opportunities, but too many cause overfitting. Epochs: ● Batch size: Affects memory usage and training speed. Adjust these based on your hardware. Batch size: ● Learning rate: Controls how quickly the model adjusts. Too high and it might miss the optimal solution, too low and training takes forever. Learning rate: ● Weight decay: Can help to prevent overfitting by deterring the model from leaning too much on any single pattern. If weight decay is too large, it can lead to underfitting by preventing the model from learning the necessary patterns. Weight decay: The optimal configuration below is specialized for CPU training (remove use_cpu=True if you have a GPU): from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq training_args = TrainingArguments( output_dir="./qwen-math", # Custom output directory for the fine-tuned model use_cpu=True, # Set to False or remove to use GPU if available # Training duration num_train_epochs=2, # 3 may improve reasoning at the expense of overfitting # Batch size and memory management per_device_train_batch_size=5, # Adjust depending on your PC capacity per_device_eval_batch_size=5, # Adjust depending on your PC capacity gradient_accumulation_steps=4, # Decreases memory usage, adjust if needed # Learning rate and regularization learning_rate=2e-5, # Affects learning speed and overfitting weight_decay=0.01, # Prevents overfitting by penalizing large weights max_grad_norm=1.0, # Prevents exploding gradients warmup_ratio=0.1, # Gradually increases learning rate to stabilize training lr_scheduler_type="cosine", # Smoother decay than linear # Evaluation and checkpointing eval_strategy="steps", eval_steps=100, save_strategy="steps", save_steps=100, save_total_limit=3, # Keep only the best 3 checkpoints load_best_model_at_end=True, # Load the best checkpoint at the end of training metric_for_best_model="eval_loss", greater_is_better=False, # Logging logging_steps=25, logging_first_step=True, ) # Data collator handles padding and batching data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) # Initialize trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) # Fine-tune the base model print("Fine-tuning started...") trainer.train() Once training completes, save your fine-tuned model: trainer.save_model("./qwen-math/final") tokenizer.save_pretrained("./qwen-math/final") from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq training_args = TrainingArguments( output_dir="./qwen-math", # Custom output directory for the fine-tuned model use_cpu=True, # Set to False or remove to use GPU if available # Training duration num_train_epochs=2, # 3 may improve reasoning at the expense of overfitting # Batch size and memory management per_device_train_batch_size=5, # Adjust depending on your PC capacity per_device_eval_batch_size=5, # Adjust depending on your PC capacity gradient_accumulation_steps=4, # Decreases memory usage, adjust if needed # Learning rate and regularization learning_rate=2e-5, # Affects learning speed and overfitting weight_decay=0.01, # Prevents overfitting by penalizing large weights max_grad_norm=1.0, # Prevents exploding gradients warmup_ratio=0.1, # Gradually increases learning rate to stabilize training lr_scheduler_type="cosine", # Smoother decay than linear # Evaluation and checkpointing eval_strategy="steps", eval_steps=100, save_strategy="steps", save_steps=100, save_total_limit=3, # Keep only the best 3 checkpoints load_best_model_at_end=True, # Load the best checkpoint at the end of training metric_for_best_model="eval_loss", greater_is_better=False, # Logging logging_steps=25, logging_first_step=True, ) # Data collator handles padding and batching data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) # Initialize trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) # Fine-tune the base model print("Fine-tuning started...") trainer.train() Once training completes, save your fine-tuned model: trainer.save_model("./qwen-math/final") tokenizer.save_pretrained("./qwen-math/final") 5. Evaluate the Model After fine-tuning, measure how well your model performs using two common metrics: ● Loss: Measures how far off the model's predictions are from the target outputs, where lower values indicate better performance. Loss: ● Perplexity (the exponential of loss): Shows the same information on a more intuitive scale, where lower values mean the model is more confident in its predictions. Perplexity (the exponential of loss): For production environments, consider adding metrics like BLEU or ROUGE to measure how closely generated responses match reference answers. BLEU ROUGE import math eval_results = trainer.evaluate() print(f"Final Evaluation Loss: {eval_results['eval_loss']:.4f}") print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") import math eval_results = trainer.evaluate() print(f"Final Evaluation Loss: {eval_results['eval_loss']:.4f}") print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") You can also include other metrics like F1, which measures how good your model is at catching what matters while staying accurate. This Hugging Face lecture is a good starting point to learn the essentials of using the transformers library. Hugging Face lecture Complete fine-tuning code example After these five steps, you should have the following code combined into a single Python file: import math from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq, ) dataset = load_dataset("openai/gsm8k", "main") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer.pad_token = tokenizer.eos_token # Tokenization function adjusted for the specific dataset format def tokenize_function(examples): input_ids_list = [] labels_list = [] for question, answer in zip(examples["question"], examples["answer"]): question_tokens = tokenizer(question, add_special_tokens=False)["input_ids"] answer_tokens = tokenizer(answer, add_special_tokens=False)["input_ids"] + [tokenizer.eos_token_id] input_ids = question_tokens + answer_tokens labels = [-100] * len(question_tokens) + answer_tokens input_ids_list.append(input_ids) labels_list.append(labels) return { "input_ids": input_ids_list, "labels": labels_list, } # Tokenize the data train_dataset = dataset["train"].map( tokenize_function, batched=True, remove_columns=dataset["train"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) .shuffle(seed=42) eval_dataset = dataset["test"].map( tokenize_function, batched=True, remove_columns=dataset["test"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) print(f"Samples: {len(dataset['train'])} → {len(train_dataset)} (after filtering)") print(f"Samples: {len(dataset['test'])} → {len(eval_dataset)} (after filtering)") # Optional: Use a smaller subset for faster testing # train_dataset = train_dataset.select(range(2000)) # eval_dataset = eval_dataset.select(range(1000)) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") model.config.pad_token_id = tokenizer.pad_token_id # Configuration settings and hyperparameters for fine-tuning training_args = TrainingArguments( output_dir="./qwen-math", use_cpu=True, # Training duration num_train_epochs=2, # Batch size and memory management per_device_train_batch_size=5, per_device_eval_batch_size=5, gradient_accumulation_steps=4, # Learning rate and regularization learning_rate=2e-5, weight_decay=0.01, max_grad_norm=1.0, warmup_ratio=0.1, lr_scheduler_type="cosine", # Evaluation and checkpointing eval_strategy="steps", eval_steps=100, save_strategy="steps", save_steps=100, save_total_limit=3, load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False, # Logging logging_steps=25, logging_first_step=True, ) data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) # Fine-tune the base model print("Fine-tuning started...") trainer.train() # Save the final model trainer.save_model("./qwen-math/final") tokenizer.save_pretrained("./qwen-math/final") # Evaluate after fine-tuning eval_results = trainer.evaluate() print(f"Final Evaluation Loss: {eval_results['eval_loss']:.4f}") print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") import math from datasets import load_dataset from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq, ) dataset = load_dataset("openai/gsm8k", "main") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer.pad_token = tokenizer.eos_token # Tokenization function adjusted for the specific dataset format def tokenize_function(examples): input_ids_list = [] labels_list = [] for question, answer in zip(examples["question"], examples["answer"]): question_tokens = tokenizer(question, add_special_tokens=False)["input_ids"] answer_tokens = tokenizer(answer, add_special_tokens=False)["input_ids"] + [tokenizer.eos_token_id] input_ids = question_tokens + answer_tokens labels = [-100] * len(question_tokens) + answer_tokens input_ids_list.append(input_ids) labels_list.append(labels) return { "input_ids": input_ids_list, "labels": labels_list, } # Tokenize the data train_dataset = dataset["train"].map( tokenize_function, batched=True, remove_columns=dataset["train"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) .shuffle(seed=42) eval_dataset = dataset["test"].map( tokenize_function, batched=True, remove_columns=dataset["test"].column_names, ).filter(lambda x: len(x["input_ids"]) <= 512) print(f"Samples: {len(dataset['train'])} → {len(train_dataset)} (after filtering)") print(f"Samples: {len(dataset['test'])} → {len(eval_dataset)} (after filtering)") # Optional: Use a smaller subset for faster testing # train_dataset = train_dataset.select(range(2000)) # eval_dataset = eval_dataset.select(range(1000)) model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") model.config.pad_token_id = tokenizer.pad_token_id # Configuration settings and hyperparameters for fine-tuning training_args = TrainingArguments( output_dir="./qwen-math", use_cpu=True, # Training duration num_train_epochs=2, # Batch size and memory management per_device_train_batch_size=5, per_device_eval_batch_size=5, gradient_accumulation_steps=4, # Learning rate and regularization learning_rate=2e-5, weight_decay=0.01, max_grad_norm=1.0, warmup_ratio=0.1, lr_scheduler_type="cosine", # Evaluation and checkpointing eval_strategy="steps", eval_steps=100, save_strategy="steps", save_steps=100, save_total_limit=3, load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False, # Logging logging_steps=25, logging_first_step=True, ) data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, data_collator=data_collator, ) # Fine-tune the base model print("Fine-tuning started...") trainer.train() # Save the final model trainer.save_model("./qwen-math/final") tokenizer.save_pretrained("./qwen-math/final") # Evaluate after fine-tuning eval_results = trainer.evaluate() print(f"Final Evaluation Loss: {eval_results['eval_loss']:.4f}") print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") Before executing, take a moment to adjust your trainer configuration and hyperparameters based on what your machine can actually handle. To give you a real-world reference, here's what worked smoothly for us on a MacBook Air with the M4 chip and 16GB RAM. With this setup, it took around 6.5 hours to complete fine-tuning: ● Batch size for training: 7 Batch size for training ● Batch size for eval: 7 Batch size for eval: ● Gradient accumulation: 5 Gradient accumulation: As your model trains, keep an eye on the evaluation loss. If it increases while training loss drops, the model is overfitting. In that case, adjust epochs, lower the learning rate, modify weight decay, and other hyperparameters. In the example below, we see healthy results with eval loss decreasing from 0.496 to 0.469 and a final perplexity of 1.60. evaluation loss 0.496 0.469 1.60 6. Test the Fine-Tuned Model Now for the moment of truth – was our fine-tuning actually successful? You can manually test the fine-tuned model by prompting it with this Python code: from transformers import pipeline generator = pipeline( "text-generation", # Use `Qwen/Qwen2.5-0.5B` for testing the base model model="./qwen-math/final" ) output = generator( "James has 5 apples. He buys 3 times as many. Then gives half away. How many does he have?", return_full_text=False ) print(output[0]["generated_text"]) from transformers import pipeline generator = pipeline( "text-generation", # Use `Qwen/Qwen2.5-0.5B` for testing the base model model="./qwen-math/final" ) output = generator( "James has 5 apples. He buys 3 times as many. Then gives half away. How many does he have?", return_full_text=False ) print(output[0]["generated_text"]) In this side-by-side comparison, you can see how the before and after models respond to the same question (the correct answer is 10): With sampling enabled, both models occasionally get it right or wrong due to randomness. But setting do_sample=False in the generator() function reveals their true confidence: the model always picks its highest-probability answer. The base model confidently outputs -2 (wrong), while the fine-tuned model confidently outputs 10 (correct). That's the fine-tuning at work. do_sample=False in the generator() base model -2 wrong fine-tuned model 10 correct Fine-Tuning Best Practices Model Selection ● Choose the right base model: Domain-specific models and appropriate context windows save you from fighting against the model's existing knowledge. Choose the right base model: ● Understand the model architecture: Encoder-only models (like BERT) excel at classification tasks, decoder-only models (like GPT) at text generation, and encoder-decoder models (like T5) at transformation tasks like translation or summarization. Understand the model architecture: ● Match your model's input format: If your base model was trained with specific prompt templates, use the same format in fine-tuning. Mismatched formats confuse the model and tank performance. Match your model's input format: Data Preparation ● Prioritize data quality over quantity: Clean and accurate examples beat massive and noisy datasets every time. Prioritize data quality over quantity: ● Split training and evaluation samples: Never let your model see evaluation data during training. This lets you catch overfitting before it ruins your model. Split training and evaluation samples: ● Establish a "golden set" for evaluation: Automated metrics like perplexity don't tell you if the model actually follows instructions or just predicts words statistically. Establish a "golden set" for evaluation: Training Strategy ● Start with a lower learning rate: You're making minor adjustments, not teaching it from scratch, so aggressive rates may erase what it learned during pre-training. Start with a lower learning rate: ● Use parameter-efficient fine-tuning (LoRA/PEFT): Train only 1% of parameters to get 90%+ performance while using way less memory and time. Use parameter-efficient fine-tuning (LoRA/PEFT): ● Target all linear layers in LoRA: Targeting all layers (q_proj, k_proj, v_proj, o_proj, etc.) yields models that reason significantly better, not just mimic style. Target all linear layers in LoRA: q_proj, k_proj, v_proj, o_proj, ● Use NEFTune (noisy embedding fine-tuning): Random noise in embeddings acts as regularization, which can prevent memorization and boost conversational quality by 35+ percentage points. Use NEFTune (noisy embedding fine-tuning): noisy embedding fine-tuning ● After SFT, Run DPO: Don’t just stop after SFT. SFT teaches how to talk; DPO teaches what is good by learning from preference pairs. After SFT, Run DPO: What Are the Limitations of LLM Fine-Tuning? ● Catastrophic forgetting: Fine-tuning overwrites existing neural patterns, which can erase valuable general knowledge the model learned during pre-training. Multi-task learning, where you train on your specialized task alongside general examples, can help preserve broader capabilities. Catastrophic forgetting: ● Overfitting on small datasets: The model may memorize your training examples instead of learning patterns, causing it to fail on slightly different inputs. Overfitting on small datasets: ● High computational cost: Fine-tuning billions of parameters requires expensive GPUs, significant memory, and hours to days or weeks of training time. High computational cost: ● Bias amplification: Pre-trained models already carry biases from their training data, and fine-tuning can intensify these biases if your dataset isn't carefully curated. Bias amplification: ● Manual knowledge update: New and external knowledge may require retraining the entire model or implementing Retrieval-Augmented Generation (RAG), while repeated fine-tuning often degrades performance. Manual knowledge update: Conclusion Fine-tuning works, but only if your data is clean and your hyperparameters are dialed in. Combine it with prompt engineering for the best results, where fine-tuning handles the task specialization while prompt engineering guides the model's behavior at inference time. Continue by grabbing a model from Hugging Face that fits your use case for domain-specific fine-tuning, scrape or build a quality dataset for your task, and run your first fine-tuning session on a small subset. Once you see promising results, scale up and experiment with LoRA, DPO, or NEFTune to squeeze out better performance. The gap between reading this tutorial and having a working specialized model is smaller than you think.