Predict BJP Congress Sentiment using Deep Learning The phenomenal growth in real-time data tracking and analyzing techniques has inspired data scientists to visualize and predict sentiments, build real-time models to predict the winners, etc. Trust me, the most exciting part of it is capturing the information online from all sources and predict in real-time with the highest accuracy. The great challenge in this scenario is the accuracy and ever-increasing length of the date getting flooded from all sources every second. With the current challenges in view, I decided to use a few Deep Learning ML techniques to predict moods using Twitter data. Source Note that this article assumes a basic knowledge of data science and NLP (Natural Language Processing). But if you are a newcomer to this world, I have provided links throughout the article to help you out. This blog is structured like this: Describe deep learning algorithms, LSTM, Bi-directional LSTM, Bi-directional GRU, CNN. Train these algorithms using contextual election corpus as well as pre-trained word embeddings to predict sentiments of electing parties. Comparing the accuracy and log loss of different models. Glove Pre-trained Word Embeddings , License — Apache Verison 2.0 Source We started our sentiment classification technique with Google’s pre-trained model that represents words as vectors, built on the basis of aggregated global word-word co-occurrence statistics from a corpus. The Word2Vec model, trained by Google predicts words close to the target word with a neural network to represent linear substructures of the word vector space. Word2Vec As we represent each word with a vector and a sentence (tweet) as an average of its words (vectors) to illustrate its sentiment, it becomes obvious to train the word vector with different moods to aid in the classification and prediction process. As such, Word2Vec is trained with different RNN models. Recurrent Neural Networks A ) is a sequence of inter-linked artificial neural networks where connections between nodes form a directed graph along a sequence. They are particularly known for processing data related to sequence: text, time series, videos, etc where the output at any given instant t is affected by the output at previous instant t-1 along with the input at t. recurrent neural network (RNN : License-Obtained Source We will see how RNN based models (LSTM, GRU, Bi-directional LSTM) perform with an external embedding which has been trained and distilled on a very large corpus of data as well as with an internal embedding, where a part of the contextual corpus has been considered for training. Basic RNNs suffers from vanishing and exploding gradient problems for which LSTM based networks have evolved to handle this problem. Auto-encoder Auto-encoders are a special type of RNN known for compressing a relatively long sequence into a limited, fixed-size, dense vector. They are well known for classifying textual sentiments and hence used here for the same purpose for training and predicting mood categories for election tweets. An auto-encoder attempts to copy its input to its output through an encoder and decoder architecture. The dimension of the middle-hidden layer is lower than that of the input data. Thus, the neural network is designed to represent the input in a smart and compact way in order to reconstruct it successfully. The AutoEncoders used here follow simple Sequnce2Sequence architecture built from an input layer followed by encoding the LSTM layer, an embedding layer, decoding the LSTM layer, and a softmax layer. Both the input and the output of the entire architecture are vectorized representation of the tweets and their labeled sentiments. Finally, the output of the LSTM is passed through softmax activation to represent the sentiment category. Auto-Encoder Training with Pre-trained Glove LSTM LSTMs, kind of Recurrent Neural Networks possess internal contextual state cells that act as long-term or short-term memory cells. LSTMs solve many problems of vanilla Recurrent Neural Networks by : Helping to preserve a constant error, by continuous learning and backpropagation through time and layers.LSTMs contain gated cell that controls the flow of information. Gated cells remain responsible for information read, write, and storage. They remain primary decision-makers to retain cell state information (input gate), to determine the amount of cell state to pass on to next neural network layers (output gate) and amount of existing information from memory that can be forgotten (forget gate). Gates in LSTMs contain analog information ranging from 0 to 1 through sigmoid activation functions. The analog information flow in gates facilitates backpropagation to happen through multiple bounded nonlinearities.LSTM solves vanishing gradient problem by keeping the gradients steep enough, therefore training relatively short batches with high accuracy. The below figure shows how word-embedding can feed an input sentence to LSTM. The LSTM layers take into consideration the previous hidden state to extract the key feature vectors that determine the sentiment of the sentence. The source code below shows how to build a with single layer of and classify tweets based on predefined classes using the “softmax” classifier and “Adam” optimizer. Word Embedding hidden LSTM 128 neurons Source code available at https://github.com/sharmi1206/elections-2019 embedded_sequences = embedding_layer(sequence_input) l_lstm = LSTM( )(embedded_sequences) preds = Dense(NO_CLASSES, activation= )(l_lstm) model = Model(sequence_input, preds) model.compile(loss= , optimizer= , metrics=[ ])model.summary()model.fit(x_train, y_train, nb_epoch= , batch_size= ) output_test = model.evaluate(x_test, y_test, verbose= ) #fileName classifyw2veclstm.pyNO_CLASSES = 8 128 'softmax' 'categorical_crossentropy' 'adam' 'acc' 15 64 0 Model Summary with single Layer LSTM GRU built on top of LSTM bears a close resemblance to LSTM except for minor modifications. It captures the dependencies between time instances adaptively. GRU The absence of a memory unit like LSTM makes it incapable to control the flow of information like the LSTM unit. functions with “reset” and “update” gate. The reset gate remains located between the previous activation and the next candidate activation to allow forget from the previous state. The update gate evaluates on the level of information propagation and accordingly infers how much of the candidate activation to use in updating the cell state. Possesses fewer parameters and thus may train a bit faster or need fewer data to generalize. Falls short to LSTM in processing larger datasets where LSTMs have shown to perform better. GRU The source code below shows how to build a with and classify tweets using the classifier and optimizer. GRU a single hidden layer “softmax” “Adam” https://github.com/sharmi1206/elections NO_CLASSES = embedded_sequences = embedding_layer(sequence_input) l_lstm = GRU( )(embedded_sequences) preds = Dense(NO_CLASSES, activation= )(l_lstm) model = Model(sequence_input, preds) model.compile(loss= , optimizer= , metrics=[ ])model.summary()model.fit(x_train, y_train, nb_epoch= , batch_size= ) output_test = model.evaluate(x_test, y_test, verbose= ) #fileName classifyw2veclstm.py at -2019 8 128 'softmax' 'categorical_crossentropy' 'adam' 'acc' 15 64 0 Model Summary with single Layer GRU Bi-directional LSTM ( ) connects two hidden layers of opposite directions. The connections end at the same output. As the information flow of both directions is captured, it increases the amount of input information available to the network. This architecture facilitates the output layer to get information from the past (backward) and future (forward) states simultaneously. Bidirectional Recurrent Neural Networks BRNN BRNN has been used in analyzing public sentiments towards elections as the election context is fed as its input and BRNN has increased performance when the knowledge of words proceeding and following the most polarized word is taken into consideration from either direction. BRNN aims to : Divide the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states). This facilitates information inclusion from both the past and future of the current time frame. The output of two states remains disconnected with the inputs of the opposite direction states. BRNNs can be trained using similar algorithms to RNNs, because the training process does not involve any interactions between both the directional neurons. The training involves three steps with forward pass, backward pass, and weight updates: For forward pass, forward states and backward states are passed first to the next hidden layer. Next, the states from the output neurons are passed. For the backward pass, states from output neurons are passed first. Afterward forward and backward states are passed. After forward and backward passes are completed, the hidden layers’ weights are updated.Bi-directional LSTM model summary Bi-directional LSTM model summary Convolutional Neural Networks (CNN) used for sentiment prediction using pre-trained word embeddings is composed of 1D convolution layers and 1D Global Max Pooling layers with 128 filters.1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using a filter size of 5, sliding over 5 words at a time. CNN Single-layer CNN with 128 filters CODE model = Sequential() model.add(layers.Embedding(len(word_index) + , EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable= )) model.add(layers.Conv1D( , , activation= )) model.add(layers.GlobalMaxPooling1D()) model.add(Dense( , activation= )) model.compile(loss= , optimizer= , metrics=[ ]) model.summary() history = model.fit(x_train, y_train, nb_epoch= , batch_size= , validation_data=(x_test, y_test)) loss, accuracy = model.evaluate(x_train, y_train, verbose= ) print( .format(accuracy)) loss, accuracy = model.evaluate(x_test, y_test, verbose= ) print( .format(accuracy)) #fileName classifygloveattlstm.py at https://github.com/sharmi1206/elections-2019 1 True 128 5 'relu' 8 'softmax' 'categorical_crossentropy' 'adam' 'acc' 15 64 False "Training Accuracy: {:.4f}" False "Testing Accuracy: {:.4f}" LSTM, Bi-directional LSTM, Bi-directional GRU with Attention Mechanism mechanisms allow neural networks to decide which vectors (or words) from the past are important for future decisions by considering them in context to the word in question. In this process, it filters important and relevant chunks of information, and force hops in parts of the sequence that is not relevant to the final goal or task. Such relationships among words and connection to neighboring words can be represented by directed arcs of a semantic dependency graph. Attention Further, an mechanism takes into account the input from several time steps, distributes attention over the hidden states by assigning different weights, or degrees of importance, to those inputs. For a fixed target word, the first task is to loop over all encoders’ states to compare target and source states to generate scores for each state in encoders. A softmax is then introduced to normalize all scores, which generates the probability distribution conditioned on target states. At last, the weights are introduced to make the context vector easy to train, so that it gives a predicted output. attention The principle advantage of attention mechanism lies in the context vector’s ability to take all cells’ outputs as input to compute the probability distribution of the source, providing the decoder an ability to represent global information, instead of a single hidden state. Source-Bi-directional GRU and LSTM networks with Attention mechanism, wiki Model Summary Bi-directional LSTM/GRU with Attention layer, Source - networks with mechanism, Source: wiki Model Summary Bi-directional LSTM/GRU with Attention layer, Source -Own Bi-directional GRU and LSTM Attention The source code below shows how to build a and classify tweets based on predefined classes using the “softmax” classifier and “Adam” optimizer. Source code available at single Bi-directional GRU layer, with the Attention layer of 64 neurons, https://github.com/sharmi1206/elections-2019 CODE keras.layers Dense keras.layers GRU, Bidirectional, Embedding keras.models Modelfrom sklearn.metrics log_loss, accuracy_score sklearn metrics sklearn.metrics confusion_matrixNO_CLASSES = embedding_layer = Embedding(len(word_index) + , EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable= )sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype= ) embedded_sequences = embedding_layer(sequence_input) l_gru = Bidirectional(GRU( , return_sequences= ))(embedded_sequences) l_att = AttLayer( )(l_gru)preds = Dense(NO_CLASSES, activation= )(l_att) model = Model(sequence_input, preds) model.compile(loss= , optimizer= , metrics=[ ])model.summary() model.fit(x_train, y_train, nb_epoch= , batch_size= ) final_pred = np.argmax(output_test, axis= ) org_y_label = [np.where(r== )[ ][ ] r y_test] results = confusion_matrix(org_y_label, final_pred) precisions, recall, f1_score, true_sum = metrics.precision_recall_fscore_support(org_y_label, final_pred) pred_indices = np.argmax(output_test, axis= ) classes = np.array(range( , NO_CLASSES)) preds = classes[pred_indices] print( .format(log_loss(classes[np.argmax(y_test, axis= )], output_test))) print( .format(accuracy_score(classes[np.argmax(y_test, axis= )], preds))) #fileName classifygloveattlstm.py at https://github.com/sharmi1206/elections-2019 from import from import from import import from import from import 8 1 True 'int32' 100 True #Refr:https://github.com/richliao/textClassifier/issues/28 64 'softmax' 'categorical_crossentropy' 'adam' 'acc' 15 64 #Evaluate model Accuracyoutput_test = model.predict(x_test) 1 1 0 0 for in 1 0 'Log loss: {}' 1 'Accuracy: {}' 1 Accuracy with Pre-trained Word Embeddings Accuracy and Log Loss for sentiment prediction BJP vs Congress Word Embeddings with Convolutional Neural Networks (CNN) on Election Tweets Convolution Neural Networks with Word2Vec Models with Gensim by building the election corpus, Source -Wiki The tool takes a text corpus (list of tweets) as input and produces the word vectors as output. It first constructs a unique vocabulary set from the training text data (list of tokenized tweets) and then learns vector representation of words, representing n-gram features that aid in the sentiment classification process. The process is known as word embedding as used in pre-trained word embeddings, the only difference being the training process takes place using election tweets instead of pre-trained data. We used Keras to convert positive integer representations of words into a word embedding by an word2vec Embedding layer. CODE num_words = tokenizer = Tokenizer(num_words=num_words) tokenizer.fit_on_texts(combined_df[ ].values) word_index = tokenizer.word_index X = tokenizer.texts_to_sequences(combined_df[ ].values) X = pad_sequences(X, maxlen= ) Y = pd.get_dummies(combined_df[ ]).valuesword2vec = Word2Vec(sentences=tokenized_corpus, size=vector_size, window=window_size, iter= , seed= , workers=multiprocessing.cpu_count()) X_vecs = word2vec.wv #fileNameclassifyw2veccnn.py at https://github.com/sharmi1206/elections-2019 20000 'tweet' # Pad the tweet data 'tweet' 2000 'mood' 500 300 # Copy word vectors used for sentiment prediction is composed of 1D convolution layers and 1D pooling layers over a series of 4 layers, with 32, 64, 128, and 256 filters respectively in each layer. CNN 1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using a filter-size of 3, sliding over 3 words at a time. This allows considering at 3-grams to understand how words contribute to sentiment in the context of those around them. After each convolution, we add a max-pool layer to extract the most significant elements and turn them into a feature vector. Further, we also add a regularization of 20% to ensure the model does not overfit. The resultant tensor of varying shape is concatenated into one big, single columned vector through flattening. The long feature vector is then used by a dense layer with software activation to yield a resultant classified output. CODE keras.layers.core Dense, Dropout, Flatten keras.layers.convolutional Conv1D, MaxPooling1D keras.optimizers Adam keras.models Sequentialbatch_size = nb_epochs = vector_size = max_tweet_length = model = Sequential() model.add(Conv1D( , kernel_size= , activation= , padding= , input_shape=(max_tweet_length, vector_size))) model.add(MaxPooling1D(pool_size= )) model.add(Dropout( )) model.add(Conv1D( , kernel_size= , activation= , padding= )) model.add(MaxPooling1D(pool_size= )) model.add(Dropout( )) model.add(Conv1D( , kernel_size= , activation= , padding= )) model.add(MaxPooling1D(pool_size= )) model.add(Dropout( )) model.add(Conv1D( , kernel_size= , activation= , padding= , input_shape=(max_tweet_length, vector_size))) model.add(Dropout( )) model.add(MaxPooling1D(pool_size= )) model.add(Flatten()) model.add(Dense( , activation= )) model.compile(loss= , optimizer=Adam(lr= , decay= ), metrics=[ ]) model.fit(X_train, Y_train, batch_size=batch_size, shuffle= , epochs=nb_epochs)model.add(Flatten()) model.add(Dense( , activation= )) model.compile(loss= , optimizer=Adam(lr= , decay= ), metrics=[ ]) model.fit(X_train, Y_train, batch_size=batch_size, shuffle= , epochs=nb_epochs) #fileNameclassifyw2veccnn.py at https://github.com/sharmi1206/elections-2019 from import from import from import from import 64 20 512 100 32 3 'elu' 'same' 2 0.2 64 3 'elu' 'same' 2 0.2 128 3 'elu' 'same' 2 0.2 256 3 'elu' 'same' 0.2 2 8 'softmax' # Compile the model 'categorical_crossentropy' 0.001 1e-6 'accuracy' # Fit the model True 8 'softmax' # Compile the model 'categorical_crossentropy' 0.001 1e-6 'accuracy' # Fit the model True Model Summary Convolution Neural Networks Word Embeddings with Recurrent Neural Networks (LSTM/GRU/Bi-directional LSTMs) on Election Tweets The neural network architecture (each of LSTM, GRU, Bi-directional LSTM/GRU) is modeled to 20000 most frequent words, where each tweet is padded to a maximum length of 2000. The first layer is the Embedded layer that uses 128 length vectors (each word is tokenized with Keras’s Tokenizer) to represent each word. The next layer is the LSTM layer with 256 memory neurons. Finally, the results are fed to a single output Dense layer with 8 neurons and a softmax activation function to predict the associated mood. CODE NO_CLASSES = embed_dim = lstm_out = model = Sequential() model.add(Embedding(num_words, embed_dim, input_length = X.shape[ ])) model.add(LSTM(lstm_out, recurrent_dropout= , dropout= )) model.add(Dense(NO_CLASSES, activation= )) model.compile(loss = , optimizer= , metrics = [ ]) print(model.summary())X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = , random_state = , stratify=Y) model.fit(X_train, Y_train, batch_size=batch_size, shuffle= , epochs=nb_epochs)output_test = model.predict(X_test) #fileName classifyw2veclstm.py at https://github.com/sharmi1206/elections-2019 8 128 256 1 0.2 0.2 'softmax' 'categorical_crossentropy' 'adam' 'categorical_crossentropy' 0.2 42 # Fit the model True The model yields 99.58% accuracy over 5 epochs with a batch-size of 128 . RESULTS < style= >Epoch / .......... / [..............................] - ETA: : - loss: - acc: / [..............................] - ETA: : - loss: - acc: / [..............................] - ETA: : - loss: ........ ........ / [============================&gt;.] - ETA: s - loss: - acc: / [============================&gt;.] - ETA: s - loss: - acc: / [==============================] - s ms/step - loss: - acc: </ > code "box-sizing: border-box; margin: 0px; padding: 0px; border: none; outline: none; font-family: SFMono-Regular, Menlo, Monaco, Consolas, &quot;Liberation Mono&quot;, &quot;Courier New&quot;, monospace; font-size: inherit; color: inherit; overflow-wrap: break-word; word-break: normal;" 5 5 .64 7344 58 45 0.0218 1.0000 128 7344 54 28 0.0259 1.0000 192 7344 57 35 7232 7344 58 0.0328 0.9960 7296 7344 24 0.0330 0.9959 7344 7344 3811 519 0.0331 0.9958 code Conclusion In this post, we reviewed deep learning methods for creating vector representations of sentences with RNNs, CNNs, and presented their effectiveness in solving a supervised sentiment prediction. With and with perform the best, while performs the worst both in case of BJP and Congress. With the matrix solely trained with election context tweets increase the accuracy of models ( to almost 99.5%. But the model performs the worst, with 50% accuracy. glove pre-trained word embeddings, Bi-directional LSTM Bidirectional GRU Attention Layer Auto-encoder model, Word Embedding LSTM, GRU, Bi-directional LSTM/GRU) CNN However, each of these models can be further improved using extensive tuning of hyper-parameters, different epochs, learning rates, and the addition of more labeled data for minority classes. Further altering the neural network architecture by increasing or decreasing the number of neurons and hidden layers might give added improvements. References https://www.researchgate.net/figure/The-architecture-of-sentence-representation-learning-network_fig2_325642880 https://blog.myyellowroad.com/unsupervised-sentence-representation-with-deep-learning-104b90079a93 https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/ http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ https://code.google.com/archive/p/word2vec/ Please let me know if there were any mistakes, suggestions feedbacks are welcome. The election repository is available at https://github.com/sharmi1206/elections-2019 . Please feel free to follow me at Linkedin .