I was one of the early adopters of the chatbot technology. I have been building chatbots for the past two years, consuming NLP platforms like Microsoft LUIS, Dialogflow, wit.ai, etc., When I wanted to understand the bare bones of it, Ashish Cherian came across Rasa, and We built a chatbot using it and conducted a workshop at PyDelhi Conference 2017.
This year I wanted to sharpen my ML skills, and I narrowed my focus to just NLP. After a round of tokenizing, POS Tagging, Topic modeling, and Text classification it was time to put it all together into a chatbot framework but, I had no idea how to go about it.
It was around the same time Rasa sent me their newsletter with a call for first PRs. I forked the repo and set up the environment and played around with it a bit.
About a week later, The ML expert Siraj Raval announced the 100DaysOfMLCode challenge and that boosted my enthusiasm to a different level making me want to do it for sure. I decided to dedicate the whole 100 days to focus only on NLP starting with reading and understanding Rasa NLUs code base.
This blog is for the ones who want to understand the ML concepts that drive the chatbot technologies. If you are a complete newbie to chatbots & NLP, I strongly recommend you to go through the following links, understand the basics and build a chatbot using RasaNLU before diving deeper.
Disclaimer:
If you are planning to move forward without trying the links above, it is going to be super hard for you to understand the rest of the blog.
RasaNLU is built using python. It would be better if you have basic python knowledge to understand the code snippets.
The starting point of any repository can be found by looking at its documentation. It will be the first file you import from the package or the file you hit from the command line.
RasaNLU has two entry points — Train and Server.
train.py
.
$ python -m rasa_nlu.train \--config sample_configs/config_spacy.yml \--data data/examples/rasa/demo-rasa.json \--path projects
2. The _server_
part is where the generated ML model is served via an API.
$ python -m rasa_nlu.server --path projects
Since the _server.py_
needs the model generated by train_.py_
, let’s start with the training part.
In this blog, we are going to explore the training part. In this part, we feed in the training json file with few configuration details, and we would get trained ML models at the end of training.
In this step, the command line arguments fed to the train.py
file are parsed and loaded into a configuration object cfg.
The training configuration is defined by config_spacy.yml
. It contains two main info language of your bot and the NLP library to use.
> cat config_spacy.yml
language: "en"pipeline: "spacy_sklearn"
The cfg
object also holds the path to your training data and path to store models after the training is complete.
With RasaNLU you can read the training data from your local machine or an external API. This comes in handy when you want to fetch data from a pre-existing database. In that case you need write api that generate data in a format consistent with Rasa’s training data format. You need to write a layer on top of your database to do this.
The _load_data_
function reads the data from the respective paths and returns a TrainingData
object.
{"text": "show me chinese restaurants","intent": "restaurant_search","entities": [{"start": 8,"end": 15,"value": "chinese","entity": "cuisine","extractor": "ner_crf","confidence": 0.854,"processors": []}]}
In this step, the loaded TrainingData
is fed into an NLP pipeline
and gets converted into an ML model. Aspacy
pipeline looks something like the one in the image.
The first step is to create a Trainer
object which takes the configuration parameter cfg
and builds a pipeline.
A pipeline is made up of components.
Each component
is responsible for a specific NLP operation.
The Trainer.train
function iterates through the pipeline and performs the NLP task defined by the component. You can think of train
function as a controller which handles controls over to different components in the pipeline and updates the context of output or info derived from each component
context = {}for i, component in enumerate(self.pipeline):updates = component.train(working_data, self.config,**context)if updates:context.update(updates)
Though there is a single pipeline, I am going to split it into three parts.
To use spacy we need to create a spacy’s NLP object depending on the language provided in the configuration file. If spacy does not support the language provided, then it throws an error.
>>> import spacy>>> nlp = spacy.load('en')
This step converts each training sample from your training file and converts them into a list of tokens(words). At the end of this step, we have a bag of words.
>>> tokens = nlp("Suggest me a chinese food")["suggest", "Me", "a", "chinese", "food"]
Now that we have the bag of words we can feed them into the ML algorithms. However, an ML algorithm understands only numerical data. It is the featurizer’s job to convert tokens into word vectors. At the end of this step, we will have a list of numbers which will make sense only for ML models. Spacy’s token comes with a vector attribute which makes this conversion easy.
>>> features = [token.vector for token in tokens]
[ 1.77235818e+00 2.89104319e+00 1.34855950e+00 4.57144260e-01-1.24784541e+00 3.25931263e+00 -6.40985250e-01 -1.46328235e+00-5.12969136e-01 -2.17798877e+00 -3.69897425e-01 4.26086336e-01...
RasaNLU supports regex in training samples for eg., in case of capturing entities like zip code, mobile number, etc, In such a case, RegexFeaturizer
looks for regex patterns in TrainingExamples
and marks 1.0
if the token matches the pattern else 0
. This step does not involve spacy as the functionality is particular to Rasa.
found = []for i, exp in enumerate(self.known_patterns):match = re.search(exp["pattern"], message.text)if <match_in_token>:found.append(1.0)else:found.append(0.0)
NER_CRF is one of the famous algorithm used to perform named entity extraction. NER stands for Named Entity Recognition and CRF is Conditional random fields which drives the whole statistics behind entity extraction.
"entities": [{"start": 8,"end": 15,"value": "chinese","entity": "cuisine","extractor": "ner_crf","confidence": 0.854,"processors": []}
The extractor
parameter in training data enables you to choose between extractors
supported by Rasa and preprocessor
parameter defines the list of operations to be done before NER. RasaNLU’s documentation explains the list of different extractors and its use case.
The above entity example goes through a series of transformation before being fed to the CRF
algorithm. At the end of the training, a CRF ML model trained with the entity samples is generated.
NER is such a vast topic which can be covered in a separate blog. To have a deep dive understanding of how this ML model is built refer to the following resources
EntitySynonymMapper
generates a mapping between the entity and its synonyms provided by the training file. The chatbot you build should be able to understand every variation of the entity. This mapper handles different variations of a single entity.
# Input -> Training_data.json
"entity_synonyms": [{"value": "vegetarian","synonyms": ["veg", "vegg", "veggie"]}
# Output -> entity_synonym.json
{'veggie': 'vegetarian', 'vegg': 'vegetarian'}
This classifier uses sklearn SVC
with GridSearch
with intent_names
as labels after a LabelEncoding
and text_features generated by the featurizer as data to generate a ML model.
# training data>>> X_train =[ 1.77235818e+00 2.89104319e+00 1.34855950e+00 4.57144260e-01 -1.24784541e+00 3.25931263e+00 -6.40985250e-01 -1.46328235e+00...
# features>>> Y = ["greet", "bye", "restaurant_search", "greet"...>>> Y_train = LabelEncoder().fit_transform(Y)>>> Y_train[0, 1, 2 ,0...
# training the ML model>>> clf = GridSearchCV(SVC(...))>>> clf.train(X_train, Y_train)
Rasa enables you to store the data in a cloud storage such as AWS, GCS or Microsoft Azure or in your local system. The persisted_path
parameter defines that configuration for you and stores the trained model in the respective position. The final output after training through all the pipeline components is an Intepreter
object which generates and saves the following files which is later used during the serve
step.
crf_model.pklentity_synonyms.jsonintent_classifier_sklearn.pklregex_featurizer.jsontraining_data.jsonmodel_metadata.json
Was the post useful to you? Hold the clap button and give a shout out to me on twitter. ❤️
In the next part, I will cover how Rasa NLU enables you to consume these ML models via an API.