Since the seminal paper “ ” of Vaswani et al, Transformer models have become by far the state of the art in NLP technology. With applications ranging from NER, Text Classification, Question Answering, or text generation, the applications of this amazing technology are limitless. Attention is all you need More specifically, BERT — which stands for Bidirectional Encoder Representations from Transformers— leverages the transformer architecture in a novel way. For example, BERT analyses both sides of the sentence with a randomly masked word to make a prediction. In addition to predicting the masked token, BERT predicts the sequence of the sentences by adding a classification token [CLS] at the beginning of the first sentence and tries to predict if the second sentence follows the first one by adding a separation token[SEP] between the two sentences. In this tutorial, I will show you how to fine-tune a BERT model to predict entities such as skills, diploma, diploma major, and experience in software job descriptions. Fine-tuning transformers requires a powerful GPU with parallel processing. For this,we use Google Colab since it provides freely available servers with GPUs. For this tutorial, we will use the newly released to fine tune our transformer. Below is a step-by-step guide on how to fine-tune the BERT model on spaCy 3. spaCy 3 library Data Labeling: To fine-tune BERT using spaCy 3, we need to provide training and dev data in the spaCy 3 JSON format ( ) which will be then converted to a .spacy binary file. We will provide the data in IOB format contained in a TSV file then convert to spaCy JSON format. see here I have only labeled 120 job descriptions with entities such as , , and for the training dataset and about 70 job descriptions for the dev dataset. skills diploma diploma major, experience In this tutorial, I used the annotation tool because it comes with extensive features such as: UBIAI ML auto-annotation Dictionary, regex, and rule-based auto-annotation Team collaboration to share annotation tasks Direct annotation export to IOB format Using the regular expression feature in UBIAI, I have pre-annotated all the experience mentions that follows the pattern “\d.*\+.*” such as “5 + years of experience in C++”. I then uploaded a csv dictionary containing all the software languages and assigned the entity skills. The pre-annotation saves a lot of time and will help you minimize manual annotation. For more information about annotation tool, please visit the page and my previous post “ ”. UBIAI documentation Introducing UBIAI: Easy-to-Use Text Annotation for NLP Applications The exported annotation will look like this: MS B-DIPLOMA O electrical B-DIPLOMA_MAJOR engineering I-DIPLOMA_MAJOR or O computer B-DIPLOMA_MAJOR engineering I-DIPLOMA_MAJOR . O + B-EXPERIENCE years I-EXPERIENCE I-EXPERIENCE industry I-EXPERIENCE experience I-EXPERIENCE . I-EXPERIENCE Familiar O O storage B-SKILLS server I-SKILLS architectures I-SKILLS O HDD B-SKILLS in 5 of with with In order to convert from IOB to JSON (see documentation ), we use spaCy 3 command: here !python -m spacy convert drive/MyDrive/train_set_bert.tsv ./ -t json -n -c iob !python -m spacy convert drive/MyDrive/dev_set_bert.tsv ./ -t json -n -c iob 1 1 After conversion to spaCy 3 JSON, we need to convert both the training and dev JSON files to .spacy binary file using this command (update the file path with your own): !python -m spacy convert drive/MyDrive/train_set_bert.json ./ -t spacy !python -m spacy convert drive/MyDrive/dev_set_bert.json ./ -t spacy Model Training: Open a new Google Colab project and make sure to select GPU as hardware accelerator in the notebook settings. In order to accelerate the training process, we need to run parallel processing on our GPU. To this end we install the NVIDIA 9.2 cuda library: !wget https: !dpkg -i cuda-repo-ubuntu1604– – -local_9 – _amd64.deb !apt-key add / /cuda-repo – -local/ fa2af80.pub !apt-get update !apt-get install cuda //developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604–9–2-local_9.2.88–1_amd64.deb 9 2 .2 .88 1 var -9 2 7 -9.2 To check the correct cuda compiler is installed, run: !nvcc --version Install the spacy library and spacy transformer pipeline: pip install -U spacy !python -m spacy download en_core_web_trf Next, we install the pytorch machine learning library that is configured for cuda 9.2: pip install torch== +cu92 torchvision== +cu92 torchaudio== -f https: 1.7 .1 0.8 .2 0.7 .2 //download.pytorch.org/whl/torch_stable.html After pytorch install, we need to install spacy transformers tuned for cuda 9.2 and change the CUDA_PATH and LD_LIBRARY_PATH as below. Finally, install the cupy library which is the equivalent of numpy library but for GPU: !pip install -U spacy[cuda92,transformers] ! CUDA_PATH=”/usr/local/cuda export -9.2 " !export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH !pip install cupy SpaCy 3 uses a config file config.cfg that contains all the model training components to train the model. In , you can select the language of the model (English in this tutorial), the component (NER) and hardware (GPU) to use and download the config file template: spaCy training page The only thing we need to do is to fill out the path for the train and dev .spacy files. Once done, we upload the file to Google Colab. Now we need to auto-fill the config file with the rest of the parameters that the BERT model will need; all you have to do is run this command: !python -m spacy init fill-config drive/MyDrive/config.cfg drive/MyDrive/config_spacy.cfg I suggest to debug your config file in case there is an error: !python -m spacy debug data drive/MyDrive/config.cfg We are finally ready to train the BERT model! Just run this command and the training should start: !python -m spacy train -g drive/MyDrive/config.cfg — output ./ 0 P.S: if you get the error cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_PTX: a PTX JIT compilation failed, just uninstall cupy and install it again and it should fix the issue. If everything went correctly, you should start seeing the model scores and losses being updated: At the end of the training, the model will be saved under folder model-best. The model scores are located in meta.json file inside the model-best folder: “performance”:{ “ents_per_type”:{ “DIPLOMA”:{ “p”: , “r”: , “f”: }, “SKILLS”:{ “p”: , “r”: , “f”: }, “DIPLOMA_MAJOR”:{ “p”: , “r”: , “f”: }, “EXPERIENCE”:{ “p”: , “r”: , “f”: } }, “ents_f”: , “ents_p”: , “ents_r”: , “transformer_loss”: , “ner_loss”: } 0.5584415584 0.6417910448 0.5972222222 0.6796805679 0.6742957746 0.6769774635 0.8666666667 0.7844827586 0.8235294118 0.4831460674 0.3233082707 0.3873873874 0.661754386 0.6745350501 0.6494490358 1408.9692438675 1269.1254348834 The scores are certainly well below a production model level because of the limited training dataset, but it’s worth checking its performance on a sample job description. Entity Extraction with Transformers To test the model on a sample text, we need to load the model and run it on our text: nlp = spacy.load(“./model-best”) text = [ s degree Computer Science or related field (Equivalent experience can substitute earned educational qualifications) - Minimum years experience C# and .NET - Minimum years overall experience developing commercial software '' 'Qualifications - A thorough understanding of C# and .NET Core - Knowledge of good database design and usage - An understanding of NoSQL principles - Excellent problem solving and critical thinking skills - Curious about new technologies - Experience building cloud hosted, scalable web services - Azure experience is a plus Requirements - Bachelor' in for 4 with 4 in '' ' ] for doc in nlp.pipe(text, disable=["tagger", "parser"]): print([(ent.text, ent.label_) for ent in doc.ents]) Below are the entities extracted from our sample job description: [ ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ) ] "C" "SKILLS" "#" "SKILLS" ".NET Core" "SKILLS" "database design" "SKILLS" "usage" "SKILLS" "NoSQL" "SKILLS" "problem solving" "SKILLS" "critical thinking" "SKILLS" "Azure" "SKILLS" "Bachelor" "DIPLOMA" "'s" "DIPLOMA" "Computer Science" "DIPLOMA_MAJOR" "4 years experience with C# and .NET\n-" "EXPERIENCE" "4 years overall experience in developing commercial software\n\n" "EXPERIENCE" Pretty impressive for only using 120 training documents! We were able to extract most of the skills, diploma, diploma major, and experience correctly. With more training data, the model would certainly improve further and yield higher scores. Conclusion: With only a few lines of code, we have successfully trained a functional NER transformer model thanks to the amazing spaCy 3 library. Go ahead and try it out on your use case and please share your results. Note, you can use annotation tool to label your data, we offer free 14 days trial. UBIAI As always, if you have any comments or questions, please leave a note below or email us at admin@ubiai.tools! Also published at https://walidamamou.medium.com/how-to-fine-tune-bert-transformer-with-spacy-3-6a90bfe57647