Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.
Model deployment is one of the most important skills you should have if you're going to work with NLP models.
Model deployment is the process of integrating your model into an existing production environment. The model will receive input and predict an output for decision-making for a specific use case.
“Only when a model is fully integrated with the business systems, we can extract real value from its predictions”. - Christopher Samiullah
There are different ways you can deploy your NLP model into production, you can use Flask, Django, Bottle e.t.c .But in today's article, you will learn how to build and deploy your NLP model with FastAPI.
In this series of articles, you will learn:
In part 1, we will focus on building an NLP model that can classify movie reviews into different sentiments. So let’s get started!
First, we need to build our NLP model. We are going to use the IMDB Movie dataset to build a simple model that can classify if the review about the movie is Positive or Negative. Here are the steps you should follow to do that.
First, we import important python packages to load data, clean the data, create a machine learning model (classifier), and save the model for deployment.
# import important modules import numpy as np import pandas as pd # sklearn modules from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.naive_bayes import MultinomialNB # classifier from sklearn.metrics import ( accuracy_score, classification_report, plot_confusion_matrix, ) from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer # text preprocessing modules from string import punctuation # text preprocessing modules from nltk.tokenize import word_tokenize import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer import re #regular expression # Download dependency for dependency in ( "brown", "names", "wordnet", "averaged_perceptron_tagger", "universal_tagset", ): nltk.download(dependency) import warnings warnings.filterwarnings("ignore") # seeding np.random.seed(123)
Load the dataset from the data folder.
# load data data = pd.read_csv("../data/labeledTrainData.tsv", sep='\t')
Show sample of the dataset.
# show top five rows of data data.head()
Our dataset has 3 columns.
Check the shape of the dataset.
# check the shape of the data data.shape
The dataset has 25,000 reviews.
We need to check if the dataset has any missing values.
# check missing values in data data.isnull().sum()
The output shows that our dataset does not have any missing values.
We can use the value_counts() method from the pandas package to evaluate the class distribution from our dataset.
# evalute news sentiment distribution data.sentiment.value_counts()
Name: sentiment, dtype: int64
In this dataset, we have an equal number of positive and negative reviews.
After analyzing the dataset, the next step is to preprocess the dataset into the right format before creating our machine learning model.
The reviews in this dataset contain a lot of unnecessary words and characters that we don't need when creating a machine learning model.
We will clean the messages by removing stopwords, numbers, and punctuation. Then we will convert each word into its base form by using the lemmatization process in the NLTK package.
The text_cleaning() function will handle all necessary steps to clean our dataset.
stop_words = stopwords.words('english') def text_cleaning(text, remove_stop_words=True, lemmatize_words=True): # Clean the text, with the option to remove stop_words and to lemmatize word # Clean the text text = re.sub(r"[^A-Za-z0-9]", " ", text) text = re.sub(r"\'s", " ", text) text = re.sub(r'http\S+',' link ', text) text = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text) # remove numbers # Remove punctuation from text text = ''.join([c for c in text if c not in punctuation]) # Optionally, remove stop words if remove_stop_words: text = text.split() text = [w for w in text if not w in stop_words] text = " ".join(text) # Optionally, shorten words to their stems if lemmatize_words: text = text.split() lemmatizer = WordNetLemmatizer() lemmatized_words = [lemmatizer.lemmatize(word) for word in text] text = " ".join(lemmatized_words) # Return a list of words return(text)
Now we can clean our dataset by using the text_cleaning() function.
#clean the review data["cleaned_review"] = data["review"].apply(text_cleaning)
Then split data into feature and target variables.
#split features and target from data X = data["cleaned_review"] y = data.sentiment.values
Our feature for training is the cleaned_review variable and the target is the sentiment variable.
We then split our dataset into train and test data. The test size is 15% of the entire dataset.
# split data into train and validate X_train, X_valid, y_train, y_valid = train_test_split( X, y, test_size=0.15, random_state=42, shuffle=True, stratify=y, )
We will train the Multinomial Naive Bayes algorithm to classify if a review is positive or negative. This is one of the most common algorithms used for text classification.
But before training the model, we need to transform our cleaned reviews into numerical values so that the model can understand the data. In this case, we will use the TfidfVectorizer method from scikit-learn. TfidfVectorizer will help us to convert a collection of text documents to a matrix of TF-IDF features.
To apply this series of steps(pre-processing and training), we will use a Pipeline class from scikit-learn that sequentially applies a list of transforms and a final estimator.
# Create a classifier in pipeline sentiment_classifier = Pipeline(steps=[ ('pre_processing',TfidfVectorizer(lowercase=False)), ('naive_bayes',MultinomialNB()) ])
Then we train our classifier.
# train the sentiment classifier sentiment_classifier.fit(X_train,y_train)
We then create a prediction from the validation set.
# test model performance on valid data y_preds = sentiment_classifier.predict(X_valid)
The model's performance will be evaluated by using the accuracy_score evaluation metric. We use accuracy_score because we have an equal number of classes in the sentiment variable.
The accuracy of our model is around 86.29% which is a good performance.
The model pipeline will be saved in the model’s directory by using the joblib python package.
#save model import joblib joblib.dump(sentiment_classifier, '../models/sentiment_model_pipeline.pkl')
Congratulations 👏👏, you have made it to the end of this part 1. I hope you have learned something new on how to build a NLP model. In part 2 we will learn how to deploy our NLP model with FastAPI and run it in python applications.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in part 2!
You can also find me on Twitter @Davis_McDavid.
And you can read more articles like this here.
For more AI and machine learning guides, be sure to subscribe to our newsletter in the footer below.
Create your free account to unlock your custom reading experience.