How does RasaNLU perform entity extraction? I started Demystifying NLU when I committed myself to # Challenge by . For the first 10 days, I backtracked through the code base understanding By the 25th Day, I could crack Rasa 100DaysOfMLCode Siraj Raval what happens when we train the chatbot. what happens when we serve a trained model via RestAPIs. With this overall understanding, I wanted to dig deeper and explore what happens during In this blog we will drill down through the step in RasaNLU. This blog concentrates more on the structuring of data and building an ML model rather than understanding the statistical models that drive it from underneath. EntityExtraction. entity extraction RasaNLU & EntityExtraction RasaNLU supports multiple methods to extract entities. NER CRF. NER Spacy NER Duckling NER MITIE The is explained in RasaNLU’s documentation. Spacy and Duckling are commonly used for pre-trained entities like name, place, time,date etc., use-case of each of these extractors For custom entities like a product, cuisine, types of pizza we need to use or In this blog, we are going to stick with since it is the default extractor provided by Rasa. NERCRF MITIE. NERCRF NERCRF — Named Entity Extraction using Conditional Random Field Conditional Random Field(CRF) — a popular statistical model used in natural language processing, predicts sequences of labels for sequences of input samples.’ RasaNLU uses to perform entity extraction, and you can use it outside the context of RasaNLU by following . sklearncrfsuite this tutorial Training Data {'intent': 'restaurant_search','entities': [{'start': 31, 'end': 37, 'value': 'centre', 'entity': 'location',{'start': 10, 'end': 17, 'value': 'mexican', 'entity': 'cuisine'}],'text': 'show me a mexican place in the centre'} Training Sample Preprocessing To extract entities from the text, we need to them. One thing I noticed here is that Unlike other NLP techniques, we don’t remove stop words for entity extraction. is a sequential statistical model, i.e. every feature is dependent on features(tokens) before and after it and it is important to preserve the order of the word. Tokenize NERCRF 'tokens': [show, me, a, mexican, place, in, the, centre], Create EntityExtractor Object The object is created with default configuration settings provided by CRFEntityExtractor RasaNLU. >>> crf = CRFEntityExtractor() >>> crf.component_config # Default config {'BILOU_flag': True,'features': [ \['low', 'title', 'upper', 'pos', 'pos2'\], \['bias', 'low', 'word3', 'word2', 'upper','title', 'digit','pos', 'pos2','pattern'\], \['low', 'title', 'upper', 'pos', 'pos2'\]\], 'max\_iterations': 50, 'L1\_c': 1, 'L2\_c': 0.001 } The features , , etc., are the features we will be extracting from the text. We will dig deeper into what each feature means later in this blog. and are parameters for the loss functions. low title L1_C L2_C Filter Training data RasaNLU supports more than one entity extractors.A bot developer can configure it using the key in training example. If there is no extractor mentioned, is taken as the default . Hence, filter all training samples which does not have an parameter or the one that specifies extractor NERCRF EntityExtractor extractor NERCRF. def filter_trainable_entities(self, entity_examples):for ent in message.get("entities", []):extractor = ent.get("extractor")if :entities.append(ent) not extractor or extractor == self.name Create Dataset uses a certain structure of training data. In this step we will transform dataset into the one that is supported by This involves merging the entities with the training example associated with it. NERCRF RasaNLU's NERCRF. Convert Entity In the training data we have entities defined as dictionaries, In this step it is transformed to list of tuples with (start_index, end_index and entity). This will enable us to tag the tokens easily in the next step. entities = [entity["start"], entity["end"], entity["entity"]for ent in example.get("entities", [])] [(31, 37, ‘location’), (10, 17, ‘cuisine’)] BILOU Tagging BILOU means (Begin, Intermediate, Last, Other, Unigram) is a text tagging format that enables entity extraction. In this step for each token if it falls under the entity offset a entity tag is attached to it. If not tag will be attached to it. O def _bilou_tags_from_offsets(tokens, entities, missing='O'):if start_token is not None and end_token is not None:** if start_token == end_token:bilou[start_token] = 'U-%s' % labelelse:bilou[start_token] = 'B-%s' % labelfor i in range(start_token + 1, end_token):bilou[i] = 'I-%s' % labelbilou[end_token] = 'L-%s' % label** RasaNLU extracts both single and multi word entities. In case of single entity the Entity is tagged with , A multi word entity has starting word tagged with and Last word with and all intermediate word with CRFEntityExtractor. U-Entity B-Entity L L tokens :: [show, me, a, mexican, place, in, the, centre]BILOU :: [O, O, O,'U-cuisine', O, O, O, 'U-location'] tokens :: [show, me, a, mexican, place, in, New, York, City]BILOU :: [O, O, O,'U-cuisine', O, O, O, 'B-location','I-location','L-location' ] Text to CRF format This is the final step of the dataset conversion where the tags and the tokens are merged to arrive at the following format. BILOU [(token, POS_tag, entitiy, pattern)] For each in a training example a tuple of its , and in case of regex based entities. token token POSTag BILOUEntity RegexPattern >>> dataset[('show', None, 'O', None), ('me', None, 'O', None),('a', None, 'O', None), ('mexican', None, 'U-cuisine', None),('place', None, 'O', None), ('in', None, 'O', None),('the', None, 'O', None), ('centre', None, 'U-location', None)] Create Training data X_Train comes with a bunch of default configurations which we would use to generate features. As we seen before the default configuration looks something like this. NERCRF {'BILOU_flag': True,'features': [ ['low', 'title', 'upper', 'pos', 'pos2'], ['bias', 'low', 'word3', 'word2', 'upper','title', 'digit','pos', 'pos2','pattern'], ['low', 'title', 'upper', 'pos', 'pos2']],'max_iterations': 50,'L1_c': 1,'L2_c': 0.001} # Features to be extracted from token before # Features to be extracted from current token # Features to be extracted from token after If you look at the list, we can see that there are 3 set of features. For each in a we would construct data for before and after that particular . features token training_example tokens token >>> X_train = [self._sentence_to_features(sent) for sent in dataset]>>> X_train[-1][...,{'-1:low': 'the', # The previous word'-1:title': False, # is the prev word a title'-1:upper': False, # is the prev word in uppercase'0:bias': 'bias','0:low': 'centre', # Current word'0:prefix5': 'centr','0:prefix2': 'ce','0:suffix5': 'entre','0:suffix3': 'tre','0:suffix2': 're','0:upper': False,'0:title': False, # is a title'0:digit': False, # is a number0:pattern': 'N/A','EOS': True} # End of sentence] y_train is the entity tags that we generated by BILOU tagging. Y_train >>> y_train = [self._sentence_to_labels(sent) for sent in dataset]['O', 'O', 'O', 'U-cuisine', 'O', 'O', 'O', 'U-location'] Training Now that we have the data ready all we have to do is create the and it using the training data. This step happens during the training phase of RasaNLU, and the prediction(next part) happens when we deploy the model. CRF object fit crf = sklearn_crfsuite.CRF(algorithm='lbfgs',**component_config)crf.fit(X_train, y_train) Prediction During prediction, the text that the user enters is preprocessed and arrived at a format similar to the This preprocessed format is then fed into to make predictions. X_train. CRFClassifier Preprocessing training_sample = {'text': 'show me chines restaurants in the north'} {'spacy_doc': show me chines restaurants in the north,'text': 'show me chines restaurants in the north'} # 1. Load into SpacyNLP {'spacy_doc': "..."'tokens': [show, me, chines, restaurants, in, the, north],'text': "..."} # 2. Tokenize Next step is to convert the text into CRF format. Unlike training phase, the part of the list is filled with since we won’t have any information on whether a particular token is an entity or not. entity N/A # 3. Convert Text to CRF format >>> text_data = crf._from_text_to_crf(test_example)>>> text_data[('show', None, 'N/A', None),('me', None, 'N/A', None),('chines', None, 'N/A', None),('restaurants', None, 'N/A', None),('in', None, 'N/A', None),('the', None, 'N/A', None),('north', None, 'N/A', None)] >>> features = crf._sentence_to_features(text_data)>>> features # 4. Convert CRF format to features [{'BOS': True, '0:bias': 'bias', '0:low': 'show', '0:prefix5': 'show', '0:prefix2': 'sh', '0:suffix5': 'show', '0:suffix... Predict After prediction, we will get a list with the probability of a particular token being a This means we can ignore the tags with and construct a dictionary with only valid entity tags. BILOU-tag. O >>> ents = crf.ent_tagger.predict_marginals_single(features)>>> ents # 1. Predict [{'O': 0.9881565662477786, 'U-location': 0.0019499979138974536, 'U-cuisine': 0.002866549564514255, 'B-cusine': 0.00126939581550308, 'L-cusine': 0.002386854107682877, 'I-cusine': 0.0014001745853175994, 'U-cusine': 0.0019704617653066633}...] # 2.Convert prediction result to readable format >>> entities = crf._from_crf_to_json(test_example, ents))>>> entities [{'start': 8, 'end': 14, 'value': 'chines', 'entity': 'cuisine', 'confidence': 0.848068806858813}, {'start': 34, 'end': 39, 'value': 'north', 'entity': 'location', 'confidence': 0.9334140282928288}] With this, we have demystified one major component of any NLP problem — Entity extraction. We will digger deeper and understand the In the next part, Intent Classification. Was this post useful to you? Hold the clap button and give a shout out to me on twitter . ❤️