One of the use-cases wherein, beam search is applied to get relevant results is Machine Translation. For those who are not familiar with machine translation, you surely must be knowing what Google Translate is, right ? That’s what I am talking about. System like those use Beam Search technique to figure out most relevant translation from set of possible candidates. Read this Wiki to know literature definition of the same.
Let’s discuss this strategy apropos of Machine Translation use-case. If you are someone who loves to scratch out behind the scene of any fascinating research, then you must have read somewhere, somehow that Google’s Machine Translation is built upon the encoder-decoder architecture of Neural Networks. I won’t be explaining that right now but might do it in future blogs. In case, you are not aware of this architecture then hop in here first to get an idea and then comeback to read the remaining.
Machine translation model can be thought of as a “Conditional Language Model”, for a system that translates French to English, the model can be thought of probability of English sentence conditioned on French sentence.
Due to the probabilistic behavior of our model, there can be more than one translation to input French sentence. Should we pick any one at random? Absolutely, NO. We aim to find the most suitable translation (i.e maximize the join probability distribution over output words given the input sentence.)
I won’t say YES or NO but, won’t recommend being greedy. Because it may get you sub-optimal translations. Until and unless you are ready to compromise on accuracy, you should not go for this approach.
For a given French sentence, consider a English Translation as —
Translation 1: I am visiting NY this year end.
Translation 2: I am going to be visiting NY this year end.
Considering English language and it’s rules in-general,
P(going|i, am) > P(visiting|i, am)
If you go with the greedy approach, model is likely to choose Translation 2 the output instead of 1. Because the probability of “going” after “i am” should be maximum over the distribution of possible words in the vocabulary compared to “visiting” after “i am”.
I think you got me. ✌️
Beam Search is an approximate search strategy that tries to solve this in a efficient way. In it’s simplest representation, B (Beam width) is the only tunable hyper-parameter for tweaking translation results. B in-general decides the number of words to keep in-memory at each step to permute the possibilities.
2 steps to Beam Search
We will illustrate this with an example at each step level.
Step — 1
As a part of Step-1 to Beam Search the decoder network outputs the Softmax probabilities of the top B probabilities and keep them in-memory. Let’s say am, going, visiting are the top 3 probable words.
Step — 2
As a part of Step-2 to Beam Search we hardwire the Y1 in the decoder network and try to predict the next probable word if Y1 has already occurred.
Our aim is to maximize the probability of Y1 and Y2 occurring together.
P (Y1, Y2 | X) = P (Y1 | X) * P (Y2 | X, Y1)
Here X = x1, x2…. xn (all words in the input)
Step — 3
Again to work in step-3, we take top B probable (Y1, Y2) pairs from step-2 and hardwire them in the decoder network and try to find conditional probability of Y3. i.e. P(Y3 | X, Y1, Y2)
Similar to previous 2 steps we again find top B probable 3 word sequences and so on. We keep this process till <EOS> is reached.
Feel free to comment and share your thoughts. Do share and clap if you ❤ it.