In this article, you will learn to train your own text classification model from scratch using in just a few lines of code. Tensorflow A brief about text classification Text classification is an application of natural language processing that focuses on grouping a paragraph into predefined categories based on its content—for instance, classifying categories of news, whether it's sports, business, music, etc. What will you learn? One hot encoding Word embedding Neural network with an embedding layer Evaluating and testing trained model The concepts mentioned above are fundamental things that you are supposed to understand regarding natural language processing with . Moreover, you can apply them to multiple NLP-based projects, so I recommend you read this to an end to grasp it. TensorFlow Building Sentiment Analyzer We will build a simple TensorFlow model that will be classifying user's reviews as either positive or negative as a result of effectively generalizing the training data. ML libraries we need Apart from the Tensorflow itself, we also need other python libraries and tools to develop our model, and this article assumes you have installed them on your machine. 1. Numpy 2. Matplotlib 3. Tensorflow Quick Installation If you don't have those libraries installed, here a quick installation guide with pip: pip install numpy pip install tensorflow pip install matplotlib Once everything is installed, we are now ready to get our hands dirty and build our model. Getting Started First of all, we need to import all the necessary library we just installed in our codebase: numpy np tensorflow tf matplotlib.pyplot plt import as import as import as Dataset Dataset can come in various file formats (csv, json, sql); but in this article, we gonna use just a 1D array of sample customer review messages just as shown below; data_x = [ , done', , , , hate it deadly', , piece shit'] label_x = np. ([ , , , , , , , ]) 'good' 'well 'nice' 'Excellent' 'Bad' 'OOps I 'embrassing' 'A of array 1 1 1 1 0 0 0 0 We can have our label as a 1D numpy array of 0, and 1 whereby 1 stand for positive review and 0 stands for negative review arranged corresponding to the training data (data_x) just as shown below: data_x = [ , , , , , , , ] label_x = np.array([ , , , , , , , ]) 'good' 'well done' 'nice' 'Excellent' 'Bad' 'OOps I hate it deadly' 'embrassing' 'A piece of shit' 1 1 1 1 0 0 0 0 Data Engineering - One Hot Encoding The machine only understands numbers and that doesn't change when it comes to training textual data. Therefore, to be able to train it, we need a way to have a numerical representation of our text dataset. That's where on-hot encoding comes into play. Tensorflow provides an inbuilt method to help you. You can learn more about it by visiting , and here is how you put that into code; one hot encoding docs data_x = [ , , , , , , , ] label_x = np.array([ , , , , , , , ]) one_hot_x = [tf.keras.preprocessing.text.one_hot(d, ) d data_x] print(one_hot_x) 'good' 'well done' 'nice' 'Excellent' 'Bad' 'OOps I hate it deadly' 'embrassing' 'A piece of shit' 1 1 1 1 0 0 0 0 50 for in Here is the output: [[ ], [ , ], [ ], [ ], [ ], [ , , , , ], [ ], [ , ]] 21 9 34 24 20 28 41 26 9 17 26 36 9 41 With just one line of code of list comprehension, we were able to have a numerical representation of our text datasets. Data Engineering - Padding If you look carefully, you will notice it resulted in arrays of different sizes. This is due to varying lengths of individual training data. That's not good; we need to ensure our training data items have an equal length to train it. That's why we need to do padding to normalize it to a certain standard length. what padding will do is extend arrays with length lower than standard length to equal it by appending 0s and removes extra element to those with exceeding length; Now with the nature of our dataset, let's set our standard length(max_len) to be four(4) for our training data. is a parameter for the standard length, and let set it accordingly. maxlen data_x = [ , , , , , , , ] label_x = np.array([ , , , , , , , ]) one_hot_x = [tf.keras.preprocessing.text.one_hot(d, ) d data_x] padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen= , padding = ) print(padded_x) 'good' 'well done' 'nice' 'Excellent' 'Bad' 'OOps I hate it deadly' 'embrassing' 'A piece of shit' 1 1 1 1 0 0 0 0 # one hot encoding 50 for in # padding 4 'post' Your output is going to look like this: array([[ , , , ], [ , , , ], [ , , , ], [ , , , ],[ , , , ], [ , , , ], [ , , , ],[ , , , ]], dtype=int32) 21 0 0 0 9 34 0 0 24 0 0 0 20 0 0 0 28 0 0 0 26 9 17 26 36 0 0 0 9 41 0 0 As we can see now our training data is engineered now it is ready for training. Building a Model I'm assuming you have TensorFlow basics and you are familiar with sequential models. Everything is going to be as standard with the exception of an embedding layer. Why Embedding Layer? The data we have engineered is just arrays of numbers and doesn't, and it can be had to relate how one is similar to the other one by comparing numbers. We need to have an embedding layer that helps to turn those numbers into something more meaningful by turning them into dense vectors of fixed size to compute their relations. The embedding layer receives main three parameters (summation of unique words in your corpus) input_dim (size of corresponding dense vectors) output_dim (standard length of input data) input_length Here is an example: sample_data = np.array([[ ], [ ]], dtype= ) emb_layer = tf.keras.layers.Embedding( , , input_length= ) print(emb_layer(sample_data)) 1 4 'int32' 50 4 4 Your output will look like this: f.Tensor( [[[ ]] [[ ]]], shape=( , , ), dtype=float32) -0.04779602 -0.01631527 0.01087242 0.00247218 -0.03402965 0.02020274 0.02596027 -0.00916996 2 1 4 Now instead of having a bunch of meaningless 0s, we can have a vector representation for our data. Now let's put it into our project: model = tf.keras.models.Sequential([ tf.keras.layers.Embedding( , , input_length= ), tf.keras.layers.Flatten(), tf.keras.layers.Dense( , activation= ) ]) 50 8 4 1 'sigmoid' Above is the complete architecture of our text classification model with the addition of Flatten() which just reduces higher-dimensional tensor vectors into 2D. The last Dense layer which is the deciding node for our classification model will have a final say whether a review is positive or negative Now that we have initialized our model, we can finalize configuration by specifying an optimizer algorithm to be used and category of loss to be calculated: model.compile(optimizer= , loss= , metrics=[ ]) model.summary() 'adam' 'binary_crossentropy' 'accuracy' Output Model: "sequential" embedding (Embedding) (None, 4, 8) 400 flatten (Flatten) (None, 32) 0 Total params: 433 Trainable params: 433 Non-trainable params: 0 _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ Layer (type) Output Shape Param # ================================================================= _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ dense (Dense) (None, 1) 33 ================================================================= _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ Training Model Once we finish configuring, we can train our model. Since our dataset is small, we don't need many epochs to train it. However, let's fit with 1000 epochs and visualize the learning curve: numpy np tensorflow tf matplotlib.pyplot plt data_x = [ , , , , , , , ] label_x = np.array([ , , , , , , , ]) one_hot_x = [tf.keras.preprocessing.text.one_hot(d, ) d data_x] padded_x = tf.keras.preprocessing.sequence.pad_sequences(one_hot_x, maxlen= , padding = ) model = tf.keras.models.Sequential([ tf.keras.layers.Embedding( , , input_length= ), tf.keras.layers.Flatten(), tf.keras.layers.Dense( , activation= ) ]) model.compile(optimizer= , loss= , metrics=[ ]) history = model.fit(padded_x, label_x, epochs= , batch_size= , verbose= ) plt.plot(history.history[ ]) import as import as import as 'good' 'well done' 'nice' 'Excellent' 'Bad' 'OOps I hate it deadly' 'embrassing' 'A piece of shit' 1 1 1 1 0 0 0 0 # one hot encoding 50 for in # padding 4 'post' # Architecting our Model 50 8 4 1 'sigmoid' # specifying training params 'adam' 'binary_crossentropy' 'accuracy' 1000 2 0 # plotting training graph 'loss' Output The output of the training graph is going to look like this: We can see that our training was able to minimize the loss effectively, and our model is ready for testing. Model Evaluation Let's create a simple function to predict new words using the model we have just created, though it won't be that smart since our dataset was small. def predict(word): one_hot_word = [tf.keras.preprocessing.text.one_hot(word, )] pad_word = tf.keras.preprocessing.sequence.pad_sequences(one_hot_word, maxlen= , padding= ) result = model.predict(pad_word) result[ ][ ]> : print( ) : print( ) 50 4 'post' if 0 0 0.1 'you look positive' else 'damn you\'re negative' Let's test calling predict method with different word parameters: >>> predict( ) you look positive >>> predict( ) damn you 'this tutorial is cool' 'This tutorial is bad as me ' 're negative Our model was able to successfully classify the positive and negative reviews which shows it really learned something! Originally published here