Open sourced by Google Research team, achieved wide popularity amongst NLP enthusiasts for all the right reasons! It is one of the best Natural Language Processing pre-trained models with superior NLP capabilities. It can be used for language classification, question & answering, next word prediction, tokenization, etc. pre-trained models of BERT Our case study Question Answering System in Python using BERT NLP and BERT based Question and Answering system demo , developed in Python + Flask, got hugely popular garnering hundreds of visitors per day. We got a lot of appreciative and lauding emails praising our QnA demo. Along with that, we also got number of people asking about how we created this QnA demo. And till the day, we keep getting requests on how to develop such a QnA system using BERT pre-trained model open-sourced by Google. [1] [2] To start with, the readme file on the official GitHub repository of BERT provides a good amount of information about how to fine-tune the model on SQuAD 2.0 but we could see that developers are still facing issues. So, we decided to publish a step-by-step tutorial to fine-tune the BERT pre-trained model and generate inference of answers from the given paragraph and questions on using TPU. Colab In this tutorial, we are not going to cover how to create web-based interface using Python + Flask. We’ll just cover the fine-tuning and inference on Colab using TPU. You can create your own interface using Flask or Django. Overview In this tutorial we will see how to perform a fine-tuning task on SQuAD using Google Colab, for that we will use BERT GitHub Repository, BERT Repository includes: TensorFlow code for the BERT model architecture. Pre-trained models for both the lowercase and cased version of BERT-Base and BERT-Large. You can also refer or copy to follow the steps. our colab file Steps to perform BERT Fine-tuning on Google Colab 1) Change Runtime to TPU On the main menu, click on and select . Set “ “ as the hardware accelerator. Below screeenshot will help you understand how you can change the runtime to TPU. Runtime Change runtime type TPU After Clicking on “Change runtime type”, Select from the dropdown option as given in the below figure. TPU 2) Clone the BERT github repository , or idirectional mbedding epresentations from ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. You can find the academic paper here: . BERT B E R T https://arxiv.org/abs/1810.04805 BERT has two stages: and . Pre-training fine-tuning is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure. BERT has released a number of pre-trained models. Most NLP researchers will never need to pre-train their own model from scratch. Pre-training is inexpensive. One can replicate all the results given in the paper, in at most 1 hour on a single Cloud TPU, or a few hours on a GPU. For example, SQuAD can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%. Fine-tuning So our first step is to Clone the BERT github repository, below is the way by which you can clone the repo from github. Now get inside the Bert repo using “ “ command cd !git clone https: //github.com/google-research/bert.git cd bert 3) Download the BERT PRETRAINED MODEL BERT Pretrained Model List : BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters BERT-Base, Cased: 12-layer, 768-hidden, 12-heads, 110M parameters BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters BERT-Base, Multilingual Uncased (Orig, not recommended): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters BERT has released and models, that have uncased and cased version. Uncased means that the text is converted to lowercase before performing Workpiece tokenization, e.g., John Smith becomes john smith, on the other hand, cased means that the true case and accent markers are preserved. BERT-Base BERT-Large When using a cased model, make sure to pass -do_lower=False at the time of training. You can download any model of your choice. We have used the BERT-Large-Uncased Model. !wget https: //storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip # Unzip the pretrained model !unzip uncased_L _H _A zip -24 -1024 -16. 4) Download the SQUAD2.0 Dataset For the Question Answering task, we will be using SQuAD2.0 Dataset. (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD SQuAD2.0 combines the 100,000+ questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. You can download the dataset from SQUAD site https://rajpurkar.github.io/SQuAD-explorer/ #Download the SQUAD train and dev dataset !wget https: !wget https: //rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json //rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json 5) Set up your TPU environment Verify that you are connected to a TPU device You will get know your TPU Address which is used at the time of fine-tuning Perform Google Authentication to access your bucket Upload your credentials to TPU to access your GCS bucket Using code below you can do the above mentioned 4 points: datetime json os pprint random string sys tensorflow tf os.environ, TPU_ADDRESS = + os.environ[ ] print( , TPU_ADDRESS) google.colab auth auth.authenticate_user() tf.Session(TPU_ADDRESS) session: print( ) pprint.pprint(session.list_devices()) open( , ) f: auth_info = json.load(f) tf.contrib.cloud.configure_gcs(session, credentials=auth_info) import import import import import import import import as assert 'COLAB_TPU_ADDR' in 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!' 'grpc://' 'COLAB_TPU_ADDR' 'TPU address is => ' from import with as 'TPU devices:' # Upload credentials to TPU. with '/content/adc.json' 'r' as # Now credentials are set for all future sessions on this TPU. 6) Create an output directory : You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud Storage) bucket to run the colab file. Please follow the Google Cloud for how to create a GCP account and GCS bucket. You have $300 free credit to start with any GCP product, learn more about it at . Prerequisite https://cloud.google.com/tpu/docs/setup-gcp-account You can create your GCS bucket from here . http://console.cloud.google.com/storage As we are using the Cloud TPU, we need to store the pre-trained model and the output directory in the Google Cloud Storage. If you are not storing it on Bucket you may face the following error : ERROR:tensorflow: recorded training_loop: From /job:worker/replica: /task: : Unsuccessful TensorSliceReader : Failed to get matching files on uncased_L-24_H-1024_A-16/bert_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: 'uncased_L-24_H-1024_A-16/bert_model.ckpt') [[node checkpoint_initializer_14 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Error from 0 0 constructor You will get your fine_tuned model in the Google cloud storage bucket after completion of training. For that, you need to provide your BUCKET name and OUPUT DIRECTORY name. BUCKET = BUCKET, output_dir_name = BUCKET_NAME = .format(BUCKET) OUTPUT_DIR = .format(BUCKET,output_dir_name) tf.gfile.MakeDirs(OUTPUT_DIR) print( .format(OUTPUT_DIR)) 'bertnlpdemo' #@param {type:"string"} assert '*** Must specify an existing GCS bucket name ***' 'bert_output' #@param {type:"string"} 'gs://{}' 'gs://{}/{}' '***** Model output directory: {} *****' 7) Move Pretrained Model to GCS Bucket Need to move Pre-trained Model at GCS (Google Cloud Storage) bucket, as Local File System is not Supported on TPU. If you don’t move your pre-trained model to TPU you may face the error. The command allows you to move data between your local file system and the cloud, move data within the cloud, and move data between cloud storage providers. gsutil mv !gsutil mv /content/bert/uncased_L _H _A $BUCKET_NAME -24 -1024 -16 8) Training Below is the command to run the training. To run the training on TPU you need to make sure about below Hyperparameter, that is tpu must be true and provide the tpu_address that we have found above. -use_tpu=True -tpu_name=YOUR_TPU_ADDRESS !python run_squad.py \ --vocab_file=$BUCKET_NAME/uncased_L _H _A /vocab.txt \ --bert_config_file=$BUCKET_NAME/uncased_L _H _A /bert_config.json \ --init_checkpoint=$BUCKET_NAME/uncased_L _H _A /bert_model.ckpt \ --do_train= \ --train_file=train-v2 .json \ --do_predict= \ --predict_file=dev-v2 .json \ --train_batch_size= \ --learning_rate= \ --num_train_epochs= \ --use_tpu= \ --tpu_name=grpc:// : \ --max_seq_length= \ --doc_stride= \ --version_2_with_negative= \ --output_dir=$OUTPUT_DIR -24 -1024 -16 -24 -1024 -16 -24 -1024 -16 True .0 True .0 24 3e-5 2.0 True 10.1 .118 .82 8470 384 128 True Create Testing File We are creating input_file.json as a blank JSON file and then add the data in the file in the SQuAD dataset format. is used to create a file - is used to write a file in the colab - touch %%writefile You can pass your own questions and context in the below file. !touch input_file.json %%writefile input_file.json { : , : [ { : , : [ { : [ { : , : , : }, { : , : , : }, { : , : , : } ], : } ] } ] } "version" "v2.0" "data" "title" "your_title" "paragraphs" "qas" "question" "Who is current CEO?" "id" "56ddde6b9a695914005b9628" "is_impossible" "" "question" "Who founded google?" "id" "56ddde6b9a695914005b9629" "is_impossible" "" "question" "when did IPO take place?" "id" "56ddde6b9a695914005b962a" "is_impossible" "" "context" "Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet." Prediction Below is the command to perform your own custom prediction, that is you can change the input_file.json by providing your paragraph and questions after then execute the below command. !python run_squad.py \ --vocab_file=$BUCKET_NAME/uncased_L _H _A /vocab.txt \ --bert_config_file=$BUCKET_NAME/uncased_L _H _A /bert_config.json \ --init_checkpoint=$OUTPUT_DIR/model.ckpt \ --do_train= \ --max_query_length= \ --do_predict= \ --predict_file=input_file.json \ --predict_batch_size= \ --n_best_size= \ --max_seq_length= \ --doc_stride= \ --output_dir=output/ -24 -1024 -16 -24 -1024 -16 -10859 False 30 True 8 3 384 128 To make it easier for you, we have already created a Colab file which you can copy in your Google Drive and execute the commands. You can access the colab file at: . Question Answering System using BERT + SQuAD on Colab TPU If you have any further questions or doubt then please feel free to post them in comments. We’ll get back to you. Previously published at https://www.pragnakalp.com/nlp-tutorial-setup-question-answering-system-bert-squad-colab-tpu/ References Our case study Question Answering System in Python using BERT NLP BERT based Question and Answering system demo