Model deployment is one of the most important skills you should have if you're going to work with NLP models.
Model deployment is the process of integrating your model into an existing production environment. The model will receive input and predict an output for decision-making for a specific use case.
“Only when a model is fully integrated with the business systems, we can extract real value from its predictions”. - Christopher Samiullah
There are different ways you can deploy your NLP model into production, you can use Flask, Django, Bottle e.t.c .But in today's article, you will learn how to build and deploy your NLP model with FastAPI.
In this series of articles, you will learn:
In part 1, we will focus on building an NLP model that can classify movie reviews into different sentiments. So let’s get started!
First, we need to build our NLP model. We are going to use the IMDB Movie dataset to build a simple model that can classify if the review about the movie is Positive or Negative. Here are the steps you should follow to do that.
First, we import important python packages to load data, clean the data, create a machine learning model (classifier), and save the model for deployment.
# import important modules
import numpy as np
import pandas as pd
# sklearn modules
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB # classifier
from sklearn.metrics import (
accuracy_score,
classification_report,
plot_confusion_matrix,
)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# text preprocessing modules
from string import punctuation
# text preprocessing modules
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re #regular expression
# Download dependency
for dependency in (
"brown",
"names",
"wordnet",
"averaged_perceptron_tagger",
"universal_tagset",
):
nltk.download(dependency)
import warnings
warnings.filterwarnings("ignore")
# seeding
np.random.seed(123)
Load the dataset from the data folder.
# load data
data = pd.read_csv("../data/labeledTrainData.tsv", sep='\t')
Show sample of the dataset.
# show top five rows of data
data.head()
Our dataset has 3 columns.
Check the shape of the dataset.
# check the shape of the data
data.shape
(25000, 3)
The dataset has 25,000 reviews.
We need to check if the dataset has any missing values.
# check missing values in data
data.isnull().sum()
id 0
sentiment 0
review 0
dtype: int64
The output shows that our dataset does not have any missing values.
We can use the value_counts() method from the pandas package to evaluate the class distribution from our dataset.
# evalute news sentiment distribution
data.sentiment.value_counts()
1 12500
0 12500
Name: sentiment, dtype: int64
In this dataset, we have an equal number of positive and negative reviews.
After analyzing the dataset, the next step is to preprocess the dataset into the right format before creating our machine learning model.
The reviews in this dataset contain a lot of unnecessary words and characters that we don't need when creating a machine learning model.
We will clean the messages by removing stopwords, numbers, and punctuation. Then we will convert each word into its base form by using the lemmatization process in the NLTK package.
The text_cleaning() function will handle all necessary steps to clean our dataset.
stop_words = stopwords.words('english')
def text_cleaning(text, remove_stop_words=True, lemmatize_words=True):
# Clean the text, with the option to remove stop_words and to lemmatize word
# Clean the text
text = re.sub(r"[^A-Za-z0-9]", " ", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r'http\S+',' link ', text)
text = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text) # remove numbers
# Remove punctuation from text
text = ''.join([c for c in text if c not in punctuation])
# Optionally, remove stop words
if remove_stop_words:
text = text.split()
text = [w for w in text if not w in stop_words]
text = " ".join(text)
# Optionally, shorten words to their stems
if lemmatize_words:
text = text.split()
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in text]
text = " ".join(lemmatized_words)
# Return a list of words
return(text)
Now we can clean our dataset by using the text_cleaning() function.
#clean the review
data["cleaned_review"] = data["review"].apply(text_cleaning)
Then split data into feature and target variables.
#split features and target from data
X = data["cleaned_review"]
y = data.sentiment.values
Our feature for training is the cleaned_review variable and the target is the sentiment variable.
We then split our dataset into train and test data. The test size is 15% of the entire dataset.
# split data into train and validate
X_train, X_valid, y_train, y_valid = train_test_split(
X,
y,
test_size=0.15,
random_state=42,
shuffle=True,
stratify=y,
)
We will train the Multinomial Naive Bayes algorithm to classify if a review is positive or negative. This is one of the most common algorithms used for text classification.
But before training the model, we need to transform our cleaned reviews into numerical values so that the model can understand the data. In this case, we will use the TfidfVectorizer method from scikit-learn. TfidfVectorizer will help us to convert a collection of text documents to a matrix of TF-IDF features.
To apply this series of steps(pre-processing and training), we will use a Pipeline class from scikit-learn that sequentially applies a list of transforms and a final estimator.
# Create a classifier in pipeline
sentiment_classifier = Pipeline(steps=[
('pre_processing',TfidfVectorizer(lowercase=False)),
('naive_bayes',MultinomialNB())
])
Then we train our classifier.
# train the sentiment classifier
sentiment_classifier.fit(X_train,y_train)
We then create a prediction from the validation set.
# test model performance on valid data
y_preds = sentiment_classifier.predict(X_valid)
The model's performance will be evaluated by using the accuracy_score evaluation metric. We use accuracy_score because we have an equal number of classes in the sentiment variable.
accuracy_score(y_valid,y_preds)
0.8629333333333333
The accuracy of our model is around 86.29% which is a good performance.
The model pipeline will be saved in the model’s directory by using the joblib python package.
#save model
import joblib
joblib.dump(sentiment_classifier, '../models/sentiment_model_pipeline.pkl')
Congratulations 👏👏, you have made it to the end of this part 1. I hope you have learned something new on how to build a NLP model. In part 2 we will learn how to deploy our NLP model with FastAPI and run it in python applications.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in part 2!
You can also find me on Twitter @Davis_McDavid.
And you can read more articles like this here.
For more AI and machine learning guides, be sure to subscribe to our newsletter in the footer below.