Chatbots & Me I was one of the early adopters of the chatbot technology. I have been building for the past two years, consuming NLP platforms like Microsoft LUIS, Dialogflow, wit.ai, etc., When I wanted to understand the bare bones of it, came across , and We built a chatbot using it and conducted a workshop at . chatbots Ashish Cherian Rasa PyDelhi Conference 2017 NLP & Me This year I wanted to sharpen my ML skills, and I narrowed my focus to just NLP. After a round of tokenizing, POS Tagging, Topic modeling, and Text classification it was time to put it all together into a chatbot framework but, I had no idea how to go about it. It was around the same time sent me their newsletter with a call for first PRs. I forked the repo and set up the environment and played around with it a bit. Rasa About a week later, The ML expert announced the 100DaysOfMLCode challenge and that boosted my enthusiasm to a different level making me want to do it for sure. I decided to dedicate the whole 100 days to focus only on NLP starting with reading and understanding Rasa NLUs code base. Siraj Raval Before you jump in This blog is for the ones who want to understand the ML concepts that drive the chatbot technologies. If you are a complete newbie to chatbots & NLP, I you to go through the following links, understand the basics and build a chatbot using RasaNLU before diving deeper. strongly recommend What’s, why’s & How’s of Chatbot Building a chatbot using RasaNLU Disclaimer: If you are planning to move forward without trying the links above, it is going to be super hard for you to understand the rest of the blog. RasaNLU is built using python. It would be better if you have basic python knowledge to understand the code snippets. — Reading code The starting point The starting point of any repository can be found by looking at its documentation. It will be the first file you import from the package or the file you hit from the command line. RasaNLU has two entry points — and . Train Server The training part generates an ML model when you feed the training data — . train.py $ python -m rasa_nlu.train \--config sample_configs/config_spacy.yml \--data data/examples/rasa/demo-rasa.json \--path projects 2. The part is where the generated ML model is served via an API. _server_ $ python -m rasa_nlu.server --path projects Since the needs the model generated by , let’s start with the training part. _server.py_ train_.py_ Training In this blog, we are going to explore the training part. In this part, we feed in the training json file with few configuration details, and we would get trained ML models at the end of training. 1. Configuration In this step, the command line arguments fed to the file are parsed and loaded into a configuration object The training configuration is defined by . It contains two main info language of your bot and the NLP library to use. train.py cfg. config_spacy.yml > cat config_spacy.yml "en" "spacy_sklearn" language: pipeline: The object also holds the path to your training data and path to store models after the training is complete. cfg 2. Loading the training data With RasaNLU you can read the training data from your local machine or an external API. This comes in handy when you want to fetch data from a pre-existing database. In that case you need write api that generate data in a format consistent with Rasa’s training data format. You need to write a layer on top of your database to do this. The function reads the data from the respective paths and returns a object. _load_data_ TrainingData {"text": "show me chinese restaurants","intent": "restaurant_search","entities": [{"start": 8,"end": 15,"value": "chinese","entity": "cuisine","extractor": "ner_crf","confidence": 0.854,"processors": []}]} 3. Training the ML model In this step, the loaded is fed into an NLP and gets converted into an ML model. A pipeline looks something like the one in the image. TrainingData pipeline spacy The first step is to create a object which takes the configuration parameter and builds a A pipeline is made up of Each is responsible for a specific NLP operation. Trainer cfg pipeline. components. component The function iterates through the pipeline and performs the NLP task defined by the component. You can think of function as a controller which handles controls over to different components in the pipeline and updates the context of output or info derived from each component Trainer.train train context = {}for i, component in enumerate(self.pipeline):updates = component.train(working_data, self.config,**context)if updates:context.update(updates) Though there is a single pipeline, I am going to split it into three parts. Where the data is transformed to extract the required information The preprocessing step — The preprocessed data is used to create the ML models that perform intent classification and entity extraction Entity Extractor & Intent Classifier — Storing the result Persistence— Preprocessing 3.1 SpacyNLP To use spacy we need to create a spacy’s NLP object depending on the language provided in the configuration file. If spacy does not support the language provided, then it throws an error. >>> import spacy>>> nlp = spacy.load('en') 3.2 SpacyTokenizer This step converts each training sample from your training file and converts them into a list of tokens(words). At the end of this step, we have a bag of words. >>> tokens = nlp("Suggest me a chinese food")["suggest", "Me", "a", "chinese", "food"] 3.3 SpacyFeaturizer Now that we have the bag of words we can feed them into the ML algorithms. However, an ML algorithm understands only numerical data. It is the featurizer’s job to convert tokens into word vectors. At the end of this step, we will have a list of numbers which will make sense only for ML models. Spacy’s token comes with a vector attribute which makes this conversion easy. >>> features = [token.vector for token in tokens] [ 1.77235818e+00 2.89104319e+00 1.34855950e+00 4.57144260e-01-1.24784541e+00 3.25931263e+00 -6.40985250e-01 -1.46328235e+00-5.12969136e-01 -2.17798877e+00 -3.69897425e-01 4.26086336e-01... 3.4 RegexFeaturizer RasaNLU supports regex in training samples for eg., in case of capturing entities like zip code, mobile number, etc, In such a case, looks for regex patterns in and marks if the token matches the pattern else . This step does not involve spacy as the functionality is particular to Rasa. RegexFeaturizer TrainingExamples 1.0 0 found = []for i, exp in enumerate(self.known_patterns):match = re.search(exp["pattern"], message.text)if <match_in_token>:found.append(1.0)else:found.append(0.0) Entity Extraction 3.5 NER_CRF EntityExactor NER_CRF is one of the famous algorithm used to perform named entity extraction. NER stands for and CRF which drives the whole statistics behind entity extraction. Named Entity Recognition is Conditional random fields "entities": [{"start": 8,"end": 15,"value": "chinese","entity": "cuisine", "confidence": 0.854, } "extractor": "ner_crf", "processors": [] The parameter in training data enables you to choose between supported by Rasa and parameter defines the list of operations to be done before NER. explains the list of different extractors and its use case. extractor extractors preprocessor RasaNLU’s documentation The above entity example goes through a series of transformation before being fed to the algorithm. At the end of the training, a CRF ML model trained with the entity samples is generated. CRF NER is such a vast topic which can be covered in a separate blog. To have a deep dive understanding of how this ML model is built refer to the following resources https://spacy.io/usage/training#annotations https://nlu.rasa.com/entities.html?highlight=extractor 3.6 EntitySynonymMapper generates a mapping between the entity and its synonyms provided by the training file. The chatbot you build should be able to understand every variation of the entity. This mapper handles different variations of a single entity. EntitySynonymMapper # Input -> Training_data.json "entity_synonyms": [{"value": "vegetarian","synonyms": ["veg", "vegg", "veggie"]} # Output -> entity_synonym.json {'veggie': 'vegetarian', 'vegg': 'vegetarian'} Intent Classification 3.7 SklearnIntentClassifier This classifier uses with with as labels after a and text_features generated by the featurizer as data to generate a ML model. sklearn SVC GridSearch intent_names LabelEncoding # training data>>> X_train =[ 1.77235818e+00 2.89104319e+00 1.34855950e+00 4.57144260e-01 -1.24784541e+00 3.25931263e+00 -6.40985250e-01 -1.46328235e+00... # features>>> Y = ["greet", "bye", "restaurant_search", "greet"...>>> Y_train = LabelEncoder().fit_transform(Y)>>> Y_train[0, 1, 2 ,0... # training the ML model>>> clf = GridSearchCV(SVC(...))>>> clf.train(X_train, Y_train) 4. Storing the model in a persisted path Rasa enables you to store the data in a cloud storage such as AWS, GCS or Microsoft Azure or in your local system. The parameter defines that configuration for you and stores the trained model in the respective position. The final output after training through all the pipeline components is an object which generates and saves the following files which is later used during the step. persisted_path Intepreter serve crf_model.pklentity_synonyms.jsonintent_classifier_sklearn.pklregex_featurizer.jsontraining_data.jsonmodel_metadata.json Was the post useful to you? Hold the clap button and give a shout out to me on twitter . ❤️ In the next part, I will cover how Rasa NLU enables you to consume these ML models via an API.