paint-brush
How To Build and Deploy an NLP Model with FastAPI: Part 1by@davisdavid
4,870 reads
4,870 reads

How To Build and Deploy an NLP Model with FastAPI: Part 1

by Davis DavidJune 8th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this series of articles, you will learn how to build and deploy your NLP model with FastAPI. We are going to use the IMDB Movie dataset to build a simple model that can classify if the review about the movie is Positive or Negative. We will train the Multinomial Naive Bayes algorithm to classify if a review is positive or negative. This is one of the most common algorithms used for text classification. The model will receive input and predict an output for decision-making for a specific use case.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How To Build and Deploy an NLP Model with FastAPI: Part 1
Davis David HackerNoon profile picture

Model deployment is one of the most important skills you should have if you're going to work with NLP models.

Model deployment is the process of integrating your model into an existing production environment. The model will receive input and predict an output for decision-making for a specific use case.

“Only when a model is fully integrated with the business systems, we can extract real value from its predictions”. - Christopher Samiullah

There are different ways you can deploy your NLP model into production, you can use Flask, Django, Bottle e.t.c .But in today's article, you will learn how to build and deploy your NLP model with FastAPI.

In this series of  articles, you will learn:

  • How to build a NLP model that classifies IMDB Movies reviews into different sentiments.
  • What is FastAPI and how to install it.
  • How to deploy your model with FastAPI.
  • How to use your deployed NLP model in any Python application.

In part 1, we will focus on building an NLP model that can classify movie reviews into different sentiments. So let’s get started!

How to Build the NLP Model

First, we need to build our NLP model. We are going to use the IMDB Movie dataset to build a simple model that can classify if the review about the movie is Positive or Negative. Here are the steps you should follow to do that.

Import Important packages 

First, we import important python packages to load data, clean the data, create a machine learning model (classifier), and save the model for deployment.

# import important modules
import numpy as np
import pandas as pd

# sklearn modules
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB # classifier 

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    plot_confusion_matrix,
)
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# text preprocessing modules
from string import punctuation 

# text preprocessing modules
from nltk.tokenize import word_tokenize

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
import re #regular expression

# Download dependency
for dependency in (
    "brown",
    "names",
    "wordnet",
    "averaged_perceptron_tagger",
    "universal_tagset",
):
    nltk.download(dependency)
    
import warnings
warnings.filterwarnings("ignore")
# seeding
np.random.seed(123)

Load the dataset from the data folder.

# load data
data = pd.read_csv("../data/labeledTrainData.tsv", sep='\t')

Show sample of the dataset.

# show top five rows of data
data.head() 

Our dataset has 3 columns.

  • Id - This is the id of the review
  • Sentiment - either positive(1) or negative(0)
  • Review - comment about the movie

Check the shape of the dataset.

# check the shape of the data
data.shape

(25000, 3)

The dataset has 25,000 reviews.

We need to check if the dataset has any missing values.

# check missing values in data
data.isnull().sum()

id           0
sentiment    0
review       0
dtype: int64

The output shows that our dataset does not have any missing values.

How to Evaluate Class Distribution

We can use the value_counts() method from the pandas package to evaluate the class distribution from our dataset.

# evalute news sentiment distribution
data.sentiment.value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

In this dataset, we have an equal number of positive and negative reviews.

How to Process the Data

After analyzing the dataset, the next step is to preprocess the dataset into the right format before creating our machine learning model.

The reviews in this dataset contain a lot of unnecessary words and characters that we don't need when creating a machine learning model.

We will clean the messages by removing stopwords, numbers, and punctuation. Then we will convert each word into its base form by using the lemmatization process in the NLTK package.

The text_cleaning() function will handle all necessary steps to clean our dataset.

stop_words =  stopwords.words('english')

def text_cleaning(text, remove_stop_words=True, lemmatize_words=True):
    # Clean the text, with the option to remove stop_words and to lemmatize word

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9]", " ", text)
    text = re.sub(r"\'s", " ", text)
    text =  re.sub(r'http\S+',' link ', text)
    text = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text) # remove numbers
        
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
    
    # Optionally, remove stop words
    if remove_stop_words:
        text = text.split()
        text = [w for w in text if not w in stop_words]
        text = " ".join(text)
    
    # Optionally, shorten words to their stems
    if lemmatize_words:
        text = text.split()
        lemmatizer = WordNetLemmatizer() 
        lemmatized_words = [lemmatizer.lemmatize(word) for word in text]
        text = " ".join(lemmatized_words)
    
    # Return a list of words
    return(text)

Now we can clean our dataset by using the text_cleaning() function.

#clean the review
data["cleaned_review"] = data["review"].apply(text_cleaning)

Then split data into feature and target variables.

#split features and target from  data 
X = data["cleaned_review"]
y = data.sentiment.values

Our feature for training is the cleaned_review variable and the target is the sentiment variable.

We then split our dataset into train and test data. The test size is 15% of the entire dataset.

# split data into train and validate

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.15,
    random_state=42,
    shuffle=True,
    stratify=y,
)

How to Actually Create Our NLP Model

We will train the Multinomial Naive Bayes algorithm to classify if a review is positive or negative. This is one of the most common algorithms used for text classification.

But before training the model, we need to transform our cleaned reviews into numerical values so that the model can understand the data. In this case, we will use the TfidfVectorizer method from scikit-learn. TfidfVectorizer will help us to convert a collection of text documents to a matrix of TF-IDF features.

To apply this series of steps(pre-processing and training), we will use a Pipeline class from scikit-learn that sequentially applies a list of transforms and a final estimator. 

# Create a classifier in pipeline
sentiment_classifier = Pipeline(steps=[
                                 ('pre_processing',TfidfVectorizer(lowercase=False)),
                                 ('naive_bayes',MultinomialNB())
                                 ])

Then we train our classifier.

# train the sentiment classifier 

sentiment_classifier.fit(X_train,y_train)

We then create a prediction from the validation set.

# test model performance on valid data 
y_preds = sentiment_classifier.predict(X_valid)

The model's performance will be evaluated by using the accuracy_score evaluation metric. We use accuracy_score because we have an equal number of classes in the sentiment variable.

accuracy_score(y_valid,y_preds)

0.8629333333333333

The accuracy of our model is around 86.29% which is a good performance.

Save Model Pipeline

The model pipeline will be saved in the model’s directory by using the joblib python package.

#save model 
import joblib 

joblib.dump(sentiment_classifier, '../models/sentiment_model_pipeline.pkl')

Wrapping Up

Congratulations 👏👏, you have made it to the end of this part 1. I hope you have learned something new on how to build a NLP model. In part 2 we will learn how to deploy our NLP model with FastAPI and run it in python applications.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in part 2!

You can also find me on Twitter @Davis_McDavid.

And you can read more articles like this here.

For more AI and machine learning guides, be sure to subscribe to our newsletter in the footer below.