In this tutorial, we fine-tune a RoBERTa model for topic classification using the Hugging Face Transformers and Datasets libraries. By the end of this tutorial, you will have a powerful fine-tuned model for classifying topics and published it to Hugging Face 🤗 for people to use.
This article assumes you have a Hugging Face 🤗 account and working Python, NLP, and Deep learning knowledge.
By meeting these prerequisites, you will be well-prepared to follow the tutorial and get the most out of it.
We start by installing the dependencies.
!pip install transformers datasets huggingface_hub tensorboard==2.11
!sudo apt-get install git-lfs --yes
We then import the needed modules.
import torch
from datasets import load_dataset
from transformers import (
RobertaTokenizerFast,
RobertaForSequenceClassification,
TrainingArguments,
Trainer,
AutoConfig,
)
from huggingface_hub import HfFolder, notebook_login
We need to log in to Hugging Face by using a token.
notebook_login()
Let's set some variables for easier configuration.
model_id = "roberta-base"
dataset_id = "ag_news"
# relace the value with your model: ex <hugging-face-user>/<model-name>
repository_id = "achimoraites/roberta-base_ag_news"
Next, we load our dataset and do some preprocessing.
# Load dataset
dataset = load_dataset(dataset_id)
# Training and testing datasets
train_dataset = dataset['train']
test_dataset = dataset["test"].shard(num_shards=2, index=0)
# Validation dataset
val_dataset = dataset['test'].shard(num_shards=2, index=1)
# Preprocessing
tokenizer = RobertaTokenizerFast.from_pretrained(model_id)
# This function tokenizes the input text using the RoBERTa tokenizer.
# It applies padding and truncation to ensure that all sequences have the same length (256 tokens).
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))
The set_format()
function is used to specify the dataset format, making it compatible with PyTorch.
The columns
argument lists the columns that should be included in the formatted dataset.
In this case, the columns are "input_ids", "attention_mask", and "label".
By setting the format and specifying the relevant columns, we prepare the datasets for use with the Hugging Face Trainer class, which requires PyTorch tensors as input.
# Set dataset format
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
To make our model easier to use, we will create an id2label mapping. This will map the class ids to their labels.
This makes it easier to interpret the model’s output during inference.
# We will need this to directly output the class names when using the pipeline without mapping the labels later.
# Extract the number of classes and their names
num_labels = dataset['train'].features['label'].num_classes
class_names = dataset["train"].features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")
# Create an id2label mapping
id2label = {i: label for i, label in enumerate(class_names)}
# Update the model's configuration with the id2label mapping
config = AutoConfig.from_pretrained(model_id)
config.update({"id2label": id2label})
Now, we will set up our training parameters, Hugging Face 🤗 repository, and Tensorboard.
# TrainingArguments
training_args = TrainingArguments(
output_dir=repository_id,
num_train_epochs=5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="epoch",
logging_dir=f"{repository_id}/logs",
logging_strategy="steps",
logging_steps=10,
learning_rate=5e-5,
weight_decay=0.01,
warmup_steps=500,
save_strategy="epoch",
load_best_model_at_end=True,
save_total_limit=2,
report_to="tensorboard",
push_to_hub=True,
hub_strategy="every_save",
hub_model_id=repository_id,
hub_token=HfFolder.get_token(),
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
We can start the training process by running:
# Fine-tune the model
trainer.train()
Evaluate the model:
# Evaluate the model
trainer.evaluate()
We are ready to publish our model to Hugging Face 🤗
# Save our tokenizer and create a model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()
At this point, our model should have been published and will be available for use. Let's test it!
# TEST MODEL
from transformers import pipeline
classifier = pipeline('text-classification',repository_id)
text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing: quot;After the crucifixion comes the resurrection. quot; .."
result = classifier(text)
predicted_label = result[0]["label"]
print(f"Predicted label: {predicted_label}")
You have fine-tuned and published a RoBERTa model for text classification using Hugging Face 🤗 transformers and datasets libraries!
**For reference, here is my fine-tuned model on Hugging Face 🤗
\
**You can find the code here
\
Happy 🤖 learning 😀!
Also published here
Feature image: Photo by Vlada Karpovich from Pexels: https://www.pexels.com/photo/crop-young-businesswoman-using-laptop-while-drinking-tea-at-home-4050347/