Training Your Own Text Classification Model From Scratch With Tensorflow Is As Easy As ABC

In this article, you will learn to train your own text classification model from scratch using Tensorflow in just a few lines of code.

A brief about text classification

Text classification is an application of natural language processing that focuses on grouping a paragraph into predefined categories based on its content—for instance, classifying categories of news, whether it's sports, business, music, etc.

What will you learn?

One hot encoding
Word embedding
Neural network with an embedding layer
Evaluating and testing trained model

The concepts mentioned above are fundamental things that you are supposed to understand regarding natural language processing with TensorFlow. Moreover, you can apply them to multiple NLP-based projects, so I recommend you read this to an end to grasp it.

Building Sentiment Analyzer

We will build a simple TensorFlow model that will be classifying user's reviews as either positive or negative as a result of effectively generalizing the training data.

ML libraries we need

Apart from the Tensorflow itself, we also need other python libraries and tools to develop our model, and this article assumes you have installed them on your machine.

1. Numpy

2. Matplotlib

3. Tensorflow

Quick Installation

If you don't have those libraries installed, here a quick installation guide with pip:

pip install numpy
pip install tensorflow
pip install matplotlib

Once everything is installed, we are now ready to get our hands dirty and build our model.

Getting Started

First of all, we need to import all the necessary library we just installed in our codebase:

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

Dataset

Dataset can come in various file formats (csv, json, sql); but in this article, we gonna use just a 1D array of sample customer review messages just as shown below;

data_x = [
 'good',  'well done', 'nice', 'Excellent',
 'Bad', 'OOps I hate it deadly', 'embrassing', 
'A piece of shit']

label_x = np.array([1,1,1,1, 0,0,0,0])

We can have our label as a 1D numpy array of 0, and 1 whereby 1 stand for positive review and 0 stands for negative review arranged corresponding to the training data (data_x) just as shown below:


data_x = [
 'good',  'well done', 'nice', 'Excellent',
 'Bad', 'OOps I hate it deadly', 'embrassing', 
'A piece of shit']

label_x = np.array([1,1,1,1, 0,0,0,0])

Data Engineering - One Hot Encoding

The machine only understands numbers and that doesn't change when it comes to training textual data. Therefore, to be able to train it, we need a way to have a numerical representation of our text dataset. That's where on-hot encoding comes into play.

Tensorflow provides an inbuilt method to help you. You can learn more about it by visiting one hot encoding docs, and here is how you put that into code;

data_x = [
 'good',  'well done', 'nice', 'Excellent',
 'Bad', 'OOps I hate it deadly', 'embrassing', 
'A piece of shit']

label_x = np.array([1,1,1,1, 0,0,0,0])

one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]

print(one_hot_x)

Here is the output:

[[21], [9, 34], [24], [20], [28], [41, 26, 9, 17, 26], [36], [9, 41]]

With just one line of code of list comprehension, we were able to have a numerical representation of our text datasets.

Data Engineering - Padding

If you look carefully, you will notice it resulted in arrays of different sizes. This is due to varying lengths of individual training data.

That's not good; we need to ensure our training data items have an equal length to train it. That's why we need to do padding to normalize it to a certain standard length.

what padding will do is extend arrays with length lower than standard length to equal it by appending 0s and removes extra element to those with exceeding length;

Now with the nature of our dataset, let's set our standard length(max_len) to be four(4) for our training data.

maxlen is a parameter for the standard length, and let set it accordingly.


data_x = [
 'good',  'well done', 'nice', 'Excellent',
 'Bad', 'OOps I hate it deadly', 'embrassing', 
'A piece of shit']

label_x = np.array([1,1,1,1, 0,0,0,0])

# one hot encoding 

one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]

# padding 

padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen=4, padding = 'post')

print(padded_x)

Your output is going to look like this:

array([[21,  0,  0,  0], [ 9, 34,  0,  0], [24,  0,  0,  0], [20,  0,  0,  0],[28,  0,  0,  0], [26,  9, 17, 26], [36,  0,  0,  0],[ 9, 41,  0,  0]], dtype=int32)

As we can see now our training data is engineered now it is ready for training.

Building a Model

I'm assuming you have TensorFlow basics and you are familiar with sequential models. Everything is going to be as standard with the exception of an embedding layer.

Why Embedding Layer?

The data we have engineered is just arrays of numbers and doesn't, and it can be had to relate how one is similar to the other one by comparing numbers. We need to have an embedding layer that helps to turn those numbers into something more meaningful by turning them into dense vectors of fixed size to compute their relations.

The embedding layer receives main three parameters

input_dim (summation of unique words in your corpus)
output_dim (size of corresponding dense vectors)
input_length (standard length of input data)

Here is an example:

sample_data = np.array([[1], [4]], dtype='int32')

emb_layer = tf.keras.layers.Embedding(50, 4, input_length=4)

print(emb_layer(sample_data))

Your output will look like this:

f.Tensor(
[[[-0.04779602 -0.01631527  0.01087242  0.00247218]]
 [[-0.03402965  0.02020274  0.02596027 -0.00916996]]], shape=(2, 1, 4), dtype=float32)

Now instead of having a bunch of meaningless 0s, we can have a vector representation for our data. Now let's put it into our project:


model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(50, 8, input_length=4),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
 ])

Above is the complete architecture of our text classification model with the addition of Flatten() which just reduces higher-dimensional tensor vectors into 2D. The last Dense layer which is the deciding node for our classification model will have a final say whether a review is positive or negative

Now that we have initialized our model, we can finalize configuration by specifying an optimizer algorithm to be used and category of loss to be calculated:

model.compile(optimizer='adam', loss='binary_crossentropy', 
metrics=['accuracy'])

model.summary()

Output

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 4, 8)              400
_________________________________________________________________
flatten (Flatten)            (None, 32)                0
_________________________________________________________________
dense (Dense)                (None, 1)                 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________

Training Model

Once we finish configuring, we can train our model. Since our dataset is small, we don't need many epochs to train it. However, let's fit with 1000 epochs and visualize the learning curve:


import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

data_x = [
 'good',  'well done', 'nice', 'Excellent',
 'Bad', 'OOps I hate it deadly', 'embrassing', 
'A piece of shit']

label_x = np.array([1,1,1,1, 0,0,0,0])

# one hot encoding 

one_hot_x = [tf.keras.preprocessing.text.one_hot(d, 50) for d in data_x]

# padding 

padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen=4, padding = 'post')

# Architecting our Model 

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(50, 8, input_length=4),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1, activation='sigmoid')
 ])

# specifying training params 
 
model.compile(optimizer='adam', loss='binary_crossentropy', 
metrics=['accuracy'])

history = model.fit(padded_x, label_x, epochs=1000, 
batch_size=2, verbose=0)

# plotting training graph

plt.plot(history.history['loss'])

Output

The output of the training graph is going to look like this:

We can see that our training was able to minimize the loss effectively, and our model is ready for testing.

Model Evaluation

Let's create a simple function to predict new words using the model we have just created, though it won't be that smart since our dataset was small.


def predict(word):
    one_hot_word = [tf.keras.preprocessing.text.one_hot(word, 50)]
    pad_word = tf.keras.preprocessing.sequence.pad_sequences(one_hot_word, maxlen=4,  padding='post')
    result = model.predict(pad_word)
    if result[0][0]>0.1:
        print('you look positive')
    else:
        print('damn you\'re negative')

Let's test calling predict method with different word parameters:

>>> predict('this tutorial is cool')
you look positive
>>> predict('This tutorial is bad as me ')
damn you're negative

Our model was able to successfully classify the positive and negative reviews which shows it really learned something!

Originally published here