In this tutorial, we fine-tune a model for topic classification using the Hugging Face Transformers and Datasets libraries. By the end of this tutorial, you will have a powerful fine-tuned model for classifying topics and published it to Hugging Face 🤗 for people to use. RoBERTa Prerequisites This article assumes you have a 🤗 account and working Python, NLP, and Deep learning knowledge. Hugging Face : A basic understanding of machine learning concepts, such as supervised learning, training, and evaluation. Understanding of machine learning concepts will help follow the code presented in the tutorial. Python programming experience : Some experience with deep learning and natural language processing (NLP) concepts, such as neural networks, word embeddings, and tokenization, will be beneficial for grasping the ideas presented in the tutorial. Familiarity with deep learning and natural language processing . Access to a Google Colab or Jupyter Notebook environment : To publish your fine-tuned model to the Hugging Face Hub, you will need a Hugging Face account. If you do not already have an account, you can sign up for one at . A Hugging Face account https://huggingface.co/join By meeting these prerequisites, you will be well-prepared to follow the tutorial and get the most out of it. Let's get our hands dirty 😁 We start by installing the dependencies. !pip install transformers datasets huggingface_hub tensorboard==2.11
!sudo apt-get install git-lfs --yes We then import the needed modules. import torch
from datasets import load_dataset
from transformers import (
    RobertaTokenizerFast,
    RobertaForSequenceClassification,
    TrainingArguments,
    Trainer,
    AutoConfig,
)
from huggingface_hub import HfFolder, notebook_login We need to log in to Hugging Face by using a token. notebook_login() Let's set some variables for easier configuration. model_id = "roberta-base"
dataset_id = "ag_news"
# relace the value with your model: ex <hugging-face-user>/<model-name>
repository_id = "achimoraites/roberta-base_ag_news" Preprossessing Next, we load our dataset and do some preprocessing. # Load dataset
dataset = load_dataset(dataset_id)

# Training and testing datasets
train_dataset = dataset['train']
test_dataset = dataset["test"].shard(num_shards=2, index=0)

# Validation dataset
val_dataset = dataset['test'].shard(num_shards=2, index=1)

# Preprocessing
tokenizer = RobertaTokenizerFast.from_pretrained(model_id)

# This function tokenizes the input text using the RoBERTa tokenizer. 
# It applies padding and truncation to ensure that all sequences have the same length (256 tokens).
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset)) Set The Dataset Format The function is used to specify the dataset format, making it compatible with PyTorch. set_format() The argument lists the columns that should be included in the formatted dataset. columns In this case, the columns are "input_ids", "attention_mask", and "label". By setting the format and specifying the relevant columns, we prepare the datasets for use with the Hugging Face Trainer class, which requires PyTorch tensors as input. # Set dataset format
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"]) To make our model easier to use, we will create an mapping. This will map the class ids to their labels. id2label This makes it easier to interpret the model’s output during inference. # We will need this to directly output the class names when using the pipeline without mapping the labels later.
# Extract the number of classes and their names
num_labels = dataset['train'].features['label'].num_classes
class_names = dataset["train"].features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")

# Create an id2label mapping
id2label = {i: label for i, label in enumerate(class_names)}

# Update the model's configuration with the id2label mapping
config = AutoConfig.from_pretrained(model_id)
config.update({"id2label": id2label}) Training and Evaluation Now, we will set up our training parameters, Hugging Face 🤗 repository, and Tensorboard. # TrainingArguments
training_args = TrainingArguments(
    output_dir=repository_id,
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="epoch",
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=500,
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=2,
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
) We can start the training process by running: # Fine-tune the model
trainer.train() Evaluate the model: # Evaluate the model
trainer.evaluate() Publishing We are ready to publish our model to Hugging Face 🤗 # Save our tokenizer and create a model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub() Test The Model At this point, our model should have been published and will be available for use. Let's test it! # TEST MODEL

from transformers import pipeline

classifier = pipeline('text-classification',repository_id)

text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing: quot;After the crucifixion comes the resurrection. quot; .."
result = classifier(text)

predicted_label = result[0]["label"]
print(f"Predicted label: {predicted_label}") Congratulations You have fine-tuned and published a RoBERTa model for text classification using Hugging Face 🤗 transformers and datasets libraries! **For reference, here is my fine-tuned model on Hugging Face 🤗
\ https://huggingface.co/achimoraites/roberta-base_ag_news **You can find the code here
\ https://github.com/achimoraites/machine-learning-playground/blob/main/NLP/Text classification/RoBERTa_Finetuning.ipynb Happy 🤖 learning 😀! Also published here Feature image: Photo by Vlada Karpovich from Pexels: https://www.pexels.com/photo/crop-young-businesswoman-using-laptop-while-drinking-tea-at-home-4050347/

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

The writer is smart, but don't just like, take their word for it. #DoYourOwnResearch

Fine-Tuning RoBERTa for Topic Classification

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps