Photo by on Mick De Paola Unsplash The “Maybe just a quick one” series title is inspired by my most common reply to “Fancy a drink?”, which, may or may not end up in a long night. Likewise, these posts are intended to be short but I get carried away sometimes, so, apologies in advance. About 🤗 Transformers 🤗 Transformers ( ) is a collection of state-of-the-art NLU (Natural Language Understanding) and NLG (Natural Language Generation ) models.  They offer a wide variety of architectures to choose from (BERT, GPT-2, RoBERTa etc) as well as a of pre-trained models uploaded by users and organisations. Hugging Face transformers hub Fine-tuning a model One of the things that makes this library such a powerful tool is that we can use the models as a basis for tasks. In other words, they can be a starting point to apply some fine-tuning using our own data. The library is designed to easily work with both Tensorflow or PyTorch. transfer learning 🤗 Datasets is a  wrapper library that provides some tools to load and process data in many commonly used formats (CSV, JSON etc). It also makes sharing datasets and metrics for Natural Language Processing extremely easy. Hugging Face Datasets 🤗 Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. Well, let’s write some code In this example, we will start with  a pre-trained model and fine-tune it on the dataset. We will then test it  on classifying tweets as hate speech, offensive language, or neither. All coding is done in . BERT (uncased) Hate Speech and Offensive Language Google Colab Please note:  this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive. So let’s start by installing some necessary packages, import them and load the dataset. The dataset is stored in Google Drive, and the path to load it from is . So if you code along , please make sure you change the path to point to your own dataset file. /content/drive/MyDrive/Data/labeled_data.csv We are using the function to load it and then split it into train, validation, and test sets. The 3 sets are then gathered together to form a .This is a dictionary class, that offers us many methods to process the data. We will then remove some of the columns we don’t need for our classification task. load_dataset DatasetDict !pip install transformers
!pip install datasets datasets load_dataset,DatasetDict transformers AutoTokenizer,TFAutoModelForSequenceClassification tensorflow tf numpy np matplotlib.pyplot plt

The dataset located our Google Drive data folder.

DATA_PATH = dataset = load_dataset( , data_files=DATA_PATH,split= )
train_testvalid = dataset.train_test_split()
test_valid = train_testvalid[ ].train_test_split()
train_test_valid_dataset = DatasetDict({ : train_testvalid[ ], : test_valid[ ], : test_valid[ ]})
dataset = train_test_valid_dataset.remove_columns([ , , , , ]) #Install the necessary packages from import from import import as import as import as is in "/content/drive/MyDrive/Data/labeled_data.csv" 'csv' 'train' 'test' 'train' 'train' 'test' 'test' 'valid' 'train' 'hate_speech' 'offensive_language' 'neither' 'Unnamed: 0' 'count' So now we need to preprocess the data. The tool responsible for this is a What do tokenizers do? Very simply put, they split the data in tokens (these can be characters, words, part of words, depending on the model), and convert them into tensors of numeric ids, which is the form that the model can read. For this task, we are using the tokenizer from the pre-trained model we selected ( ). But let’s see how we achieve this: Tokenizer. bert-base-cased tokenizer = AutoTokenizer.from_pretrained( ) tokenizer(examples[ ], padding= , truncation= )

tokenized_datasets = dataset.map(tokenize_function, batched= )
train_dataset = tokenized_datasets[ ]
eval_dataset = tokenized_datasets[ ]
test_dataset = tokenized_datasets[ ]
tf_train_dataset = train_dataset.remove_columns([ ]).with_format( )
tf_eval_dataset = eval_dataset.remove_columns([ ]).with_format( )
tf_test_dataset = test_dataset.remove_columns([ ]).with_format( )

train_features = {x: tf_train_dataset[x].to_tensor() x tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset[ ]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch( )

eval_features = {x: tf_eval_dataset[x].to_tensor() x tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset[ ]))
eval_tf_dataset = eval_tf_dataset.batch( )

test_features = {x: tf_test_dataset[x].to_tensor() x tokenizer.model_input_names}
test_tf_dataset = tf.data.Dataset.from_tensor_slices((test_features, tf_test_dataset[ ]))
test_tf_dataset =test_tf_dataset.batch( ) "bert-base-cased" : def tokenize_function (examples) return "tweet" "max_length" True True "train" "valid" 'test' "tweet" "tensorflow" "tweet" "tensorflow" "tweet" "tensorflow" for in "class" 8 for in "class" 8 for in "class" 8 Notice how we used the dataset function, to apply our user-defined to all the elements of the dataset. map tokenize_function After applying the tokenization, we created 3 to feed the model with. We are now ready to train the model. Again, we are using the selected pre-trained model, transferring the “knowledge” it already has, but replacing its head with one that is suited to our task. We are using the class, which represents a generic Tensorflow (hence the TF prefix) model, with a sequence classification head. Also, notice the num_labels parameter which is set to 3, as this is a multi-class task with 3 distinct labels. After the training is finished, we plot the Sparse Categorical Accuracy and the Loss of both the train and the validation dataset. Tensorflow Datasets TFAutoModelForSequenceClassification model = TFAutoModelForSequenceClassification.from_pretrained( , num_labels= )
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate= ),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits= ),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

history = model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs= )

plt.plot(history.history[ ])
plt.plot(history.history[ ])
plt.title( )
plt.ylabel( )
plt.xlabel( )
plt.legend([ , ], loc= )
plt.show()
plt.plot(history.history[ ])
plt.plot(history.history[ ])
plt.title( )
plt.ylabel( )
plt.xlabel( )
plt.legend([ , ], loc= )
plt.show() "bert-base-cased" 3 5e-5 True 2 'sparse_categorical_accuracy' 'val_sparse_categorical_accuracy' 'model sparse categorical accuracy' 'accuracy' 'epoch' 'train' 'val' 'upper left' 'loss' 'val_loss' 'model loss' 'loss' 'epoch' 'train' 'val' 'upper left' We can now evaluate the model on the test dataset we created earlier: test_loss, test_acc = model.evaluate(test_tf_dataset,verbose= )
print( , test_acc) / - s - loss: - sparse_categorical_accuracy: Test accuracy: model.save_pretrained( ) 2 '\nTest accuracy:' 194 194 62 0.2596 0.9135 0.9134925603866577 "/content/drive/MyDrive/Data/hate-speech-bert" Notice that we save the model with the function offered by Transformers. This action generates a directory with two files by default: a .json file that contains the model configuration and a .h5 file with the model weights. We can also push the model to the Hugging Face Models Hub should we want to, in order to make it available to the public. save_pretrained Does it work though? Let’s see how our model does in classifying some unseen text. I will use some stereotypical racist/offensive/sexist texts posted on social media. : Warning Due to the nature of this task, the language used here can be racist, sexist, and offensive. However, this is the only way to evaluate the model’s ability. pred2label = { : , : , : }
preds = model(tokenizer([ , , ],return_tensors= ,padding=True,truncation=True))[ ]
print(preds)
class_preds = np.argmax(preds, axis= ) pred class_preds:
  print(pred2label[pred]) 0 'Hate Speech' 1 'Offensive Language' 2 'Neither' "Jews are useless , I don't see why they even exist" "Gay people suck" "Women are dressed up like whores these days" "tf" 'logits' 1 for in tf.Tensor(
[[ ]
 [ ]
 [ ]], shape=( , ), dtype=float32)
Hate Speech
Hate Speech
Offensive Language 0.37532297 0.14053927 -0.8647832 0.04699412 -0.17951615 -0.3738104 0.2524849 2.586107 -2.8212454 3 3 Looks like our model managed to correctly (well that’s subjective, but generally speaking these look correct) classify the texts. Further reading: If the above seems interesting to you, there is a lot more that this library can do. I would start by checking their documentation which is quite extensive as well as some quick video courses they provide: 🤗 Transformers Model Hub 🤗 Datasets (with Youtube Videos, we always like them ) Crash course Fine-tuning with custom datasets Happy coding.

Google

YouTube

2021 - HackerNoon Contributor of the Year - NLP

2022 - HackerNoon Contributor of the Year - Computer Vision

2022 - HackerNoon Contributor of the Year - Data Visualization

2022 - HackerNoon Contributor of the Year - Deep Learning

2022 - HackerNoon Contributor of the Year - Hr

2022 - HackerNoon Contributor of the Year - Human Resources

2022 - HackerNoon Contributor of the Year - Natural Language Processing

Nominated for 2022 - HackerNoon Contributor of the Year - Hr

Nominated for 2022 - HackerNoon Contributor of the Year - Natural Language Processing

Nominated for 2022 - HackerNoon Contributor of the Year - Computer Vision

Nominated for 2022 - HackerNoon Contributor of the Year - Data Visualization

Nominated for 2022 - HackerNoon Contributor of the Year - Deep Learning

Nominated for 2022 - HackerNoon Contributor of the Year - Human Resources

Too Long; Didn't Read

In your car, at home, or at work — Bosch technology shapes many areas of life.

How to Fine Tune a 🤗 (Hugging Face) Transformer Model

How to Fine Tune a 🤗 (Hugging Face) Transformer Model

Too Long; Didn't Read

People Mentioned

Companies Mentioned

Akis Loumpourdis

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

How to Fine Tune a 🤗 (Hugging Face) Transformer Model

Too Long; Didn't Read

People Mentioned

Companies Mentioned

Akis Loumpourdis

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES