This article aims to provide the basics of LSTMs (Long Short Term Memory) and implements a word detector using the architecture. The detector implemented in this article is a cuss word detector that detects a custom set of cuss words. What are LSTMs ??? LSTMs or Long Short term memory cells are a long term memory units that were designed to solve the vanishing gradient problem with the RNNs. Normally the memory in the RNNs is short lived. We cannot store data 8 - 9 time steps behind using an RNN. To store data for longer periods like 1000 time steps we use a LSTM. LSTM History : LSTM was proposed by and . By introducing Constant Error Carousel (CEC) units, LSTM deals with the . The initial version of LSTM block included cells, input and output gates. 1997 Sepp Hochreiter Jürgen Schmidhuber [1] vanishing gradient problem [5] : and his advisor and Fred Cummins introduced the forget gate (also called “keep gate”) into LSTM architecture, enabling the LSTM to reset its own state. 1999 Felix Gers Jürgen Schmidhuber [6] [5] : Gers & Schmidhuber & Cummins added peephole connections (connections from the cell to the gates) into the architecture. Additionally, the output activation function was omitted. 2000 [7] [5] : An LSTM based model won the connected handwriting recognition competition. Three such models were submitted by a team lead by . One was the most accurate model in the competition and another was the fastest. 2009 ICDAR Alex Graves [8] [9] : LSTM networks were a major component of a network that achieved a record 17.7% error rate on the classic natural speech dataset. 2013 phoneme TIMIT [10] : Kyunghyun Cho et al. put forward a simplified variant called (GRU). 2014 Gated recurrent unit [11] : Google started using an LSTM for speech recognition on Google Voice. According to the official blog post, the new model cut transcription errors by 49%. 2015 [12] [13] [14] : Google started using an LSTM to suggest messages in the Allo conversation app. In the same year, Google released the system for Google Translate which used LSTMs to reduce translation errors by 60%. 2016 [15] Google Neural Machine Translation [16] [17] [18] Apple announced in its that it would start using the LSTM for quicktype in the iPhone and for Siri. Worldwide Developers Conference [19] [20] [21] [22] [23] Amazon released , which generates the voices behind Alexa, using a bidirectional LSTM for the text-to-speech technology. Polly [24] : Facebook performed some 4.5 billion automatic translations every day using long short-term memory networks. 2017 [25] Researchers from , , and published a study in the Knowledge Discovery and Data Mining (KDD) conference. Their study describes a novel neural network that performs better on certain data sets than the widely used long short-term memory neural network. Michigan State University IBM Research Cornell University [26] [27] [28] Microsoft reported reaching 94.9% recognition accuracy on the , incorporating a vocabulary of 165,000 words. The approach used "dialog session-based long-short-term memory". Switchboard corpus [29] : Researchers from the proposed a related RNN architecture which represents continuous windows of time. It was derived using the and outperforms the LSTM on some memory-related benchmarks. 2019 University of Waterloo Legendre polynomials [30] An LSTM model climbed to third place on the in Large Text Compression Benchmark. [31] [32] LSTM Architecture But all of the above diagram is complex math. To simplify all of it we can view their functions i.e. what all that math represents. So, simplifying it we can represent it as In the article we are now going to use some abbreviations. LTM : Long term memory STM : Short term memory NLTM : New long term memory NSTM : New short term memory Working 1. The data from the LTM is pushed into the forget gate which remembers only certain features. 2. Then this data is pushed into the use and remember gate. 3. Now data from the STM and the event is pushed into the learn gate 4. This data is again pushed into remember and use gates. 5. The combined data in the remember gate from the learn gate and forget gate is the NLTM 6. The data in the use gate which is a combination of data from forget and learn gate is the NSTM. In case you wish to get into the core mathematics behind the LSTM make sure you check out this beautiful article. Link : https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Our Model Architecture LSTM Requirements In the case of an LSTM, for each piece of data in a sequence (say, for a word in a given sentence), there is a corresponding ℎ𝑡ht. This hidden state is a function of the pieces of data that an LSTM has seen over time; it contains some weights and, represents both the short term and long term memory components for the data that the LSTM has already seen. hidden state So, for an LSTM that is looking at words in a sentence, or help identify the type of word in a language model, and lots of other things! the hidden state of the LSTM will change based on each new word it sees. And, we can use the hidden state to predict the next word in a sequence To create an LSTM in PyTorch we use nn.LSTM(input_size=input_dim, hidden_size=hidden_dim, num_layers=n_layers) input_dim = the number of inputs (a dimension of 20 could represent 20 inputs) hidden_dim = the size of the hidden state; this will be the number of outputs that each LSTM cell produces at each time step. n_layers = the number of hidden LSTM layers to use; this is typically a value between 1 and 3; a value of 1 means that each LSTM cell has one hidden state. This has a default value of 1. Hidden State Once an LSTM has been defined with input and hidden dimensions, we can call it and retrieve the output and hidden state at every time step. out, hidden = lstm(input.view(1, 1, -1), (h0, c0)) The inputs to an LSTM are . (input, (h0, c0)) input = a Tensor containing the values in an input sequence; this has values: (seq_len, batch, input_size) h0 = a Tensor containing the initial hidden state for each element in a batch c0 = a Tensor containing the initial cell memory for each element in the batch h0 nd c0 will default to 0, if they are not specified. Their dimensions are: (n_layers, batch, hidden_dim). We know that an LSTM takes in an expected input size and hidden_dim, but sentences are rarely of a consistent size, so how can we define the input of our LSTM? Well, at the very start of this net, we'll create an Embedding layer that takes in the size of our vocabulary and returns a vector of a specified size, embedding_dim, for each word in an input sequence of words. It's important that this be the first layer in this net. You can read more about this embedding layer in . the PyTorch documentation Pictured below is the expected architecture for this tagger model. Code import torch import torch.nn nn import torch.nn.functional import torch.optim optim import matplotlib.pyplot ply import numpy np data = [( . (). () , [ , , ]), ( . (). () ,[ , , , , , ]), ( . (). () , [ , , , ]), ( . (). (),[ , , , ]), ( . (). (),[ , ]), ( . (). () , [ , , , , ]), ( . (). (),[ , , , ])] word2idx = {} sent , tag data: word sent: word not word2idx: word2idx[word] = len(word2idx) tag2idx = { : 0 , : 1} tag2rev = {0 : , 1 : } def prepare_sequence(seq , to_idx): idxs = [to_idx[word] word seq] idxs = np.array(idxs) torch.tensor(idxs) testsent = . (). () = prepare_sequence(testsent , word2idx) ( . (testsent , )) LSTMTagger(nn.Module): def __init__(self,embedding_dim,hidden_dim,vocab_size,tagset_size): super(LSTMTagger , self).__init__() self.hidden_dim = hidden_dim self.word_embedding = nn.Embedding(vocab_size , embedding_dim= embedding_dim) self.lstm = nn.LSTM(input_size= embedding_dim , hidden_size = hidden_dim) self.hidden2tag = nn.Linear(hidden_dim , tagset_size) self.hidden = self.init_hidden() def init_hidden(self): (torch.randn(1 , 1 , self.hidden_dim), torch.randn(1 , 1 , self.hidden_dim)) def forward(self , sentence): embeds = self.word_embedding(sentence) lstm_out , hidden_out = self.lstm(embeds. (len(sentence) , 1 , -1) , self.hidden) tag_outputs = self.hidden2tag(lstm_out. (len(sentence) , -1)) tag_scores = F.log_softmax(tag_outputs , dim = 1) tag_scores EMBEDDING_DIM = 6 HIDDEN_DIM = 6 model = LSTMTagger(EMBEDDING_DIM , HIDDEN_DIM , len(word2idx) , len(tag2idx)) loss_function = nn.NLLLoss() optimizer = optim.SGD(model.parameters() , lr = 0.1) n_epochs = 300 epoch (n_epochs): epoch_loss = 0.0 sent , tags data: model.zero_grad() input_sent = prepare_sequence(sent , word2idx) tag = prepare_sequence(tags , tag2idx) model.hidden = model.init_hidden() output = model(input_sent) loss = loss_function(output , tag) epoch_loss += loss.item() loss.backward() optimizer.step() epoch % 20 == 19: ( . (epoch , epoch_loss / len(data))) testsent = . (). () = prepare_sequence(testsent , word2idx) ( . (testsent)) tags = model( ) _,pred_tags = torch. (tags , 1) ( . (pred_tags)) pred = np.array(pred_tags) i (len(testsent)): ( . (testsent[i] , tag2rev[pred[i]])) as as F as as as "What the fuck" lower split "O" "O" "CS" "The boy asked him to fuckoff" lower split "O" "O" "O" "O" "O" "CS" "I hate that bastard" lower split "O" "O" "O" "CS" "He is a dicked" lower split "O" "O" "O" "CS" "Hey prick" lower split "O" "CS" "What a pussy you are" lower split "O" "O" "CS" "O" "O" "Dont be a cock" lower split "O" "O" "O" "CS" for in for in if in "O" "CS" "O" "CS" for in return "fuckoff boy" lower split inp print "The test sentence {} is tranlated to {}\r\n" format inp class return view view return for in range for in if print "Epoch : {} , loss : {}" format "You " lower split inp print "Input sent : {}" format inp max print "Pred tag : {}" format for in range print "Word : {} , Predicted tag : {}" format For more well documented code kindly check this GitHub repository which contains detailed instructions. Link : https://github.com/srimanthtenneti/Cuss-Word-Detector---LSTM Conclusion This is how we use LSTMs to make a word detector. Contact Feel free to connect. Link : https://www.linkedin.com/in/srimanth-tenneti-662b7117b/