An approximate search strategy to choose better results from all possible candidates One of the use-cases wherein, beam search is applied to get relevant results is . For those who are not familiar with machine translation, you surely must be knowing what is, right ? That’s what I am talking about. System like those use Beam Search technique to figure out most relevant translation from set of possible candidates. Read this to know literature definition of the same. Machine Translation Google Translate Wiki Let’s discuss this strategy apropos of Machine Translation use-case. If you are someone who loves to scratch out behind the scene of any fascinating research, then you must have read somewhere, somehow that Google’s Machine Translation is built upon the . I won’t be explaining that right now but might do it in future blogs. In case, you are not aware of this architecture then to get an idea and then comeback to read the remaining. encoder-decoder architecture of Neural Networks hop in here first A Perspective… Machine translation model can be thought of as a , for a system that translates French to English, the model can be thought of probability of English sentence conditioned on French sentence. “ Conditional Language Model ” Due to the probabilistic behavior of our model, there can be more than one translation to input French sentence. . We aim to find the most suitable translation Absolutely, NO Should we pick any one at random? (i.e maximize the join probability distribution over output words given the input sentence.) Should we act Greedy ? I won’t say YES or NO but, won’t recommend being greedy. Because it may get you sub-optimal translations. Until and unless you are ready to compromise on accuracy, you should not go for this approach. Why not Greedy ? For a given French sentence, consider a English Translation as — Translation 1: I am visiting NY this year end. Translation 2: I am going to be visiting NY this year end. Considering English language and it’s rules in-general, P(going|i, am) > P(visiting|i, am) If you go with the greedy approach, model is likely to choose Translation 2 the output instead of 1. Because the probability of “ ” after “ ” should be maximum over the distribution of possible words in the vocabulary compared to “ ” after “ ”. going i am visiting i am ✌️ I think you got me. Let’s get to the point… is an that tries to solve this in a efficient way. In it’s simplest representation, is the only tunable hyper-parameter for tweaking translation results. in-general decides the number of words to keep in-memory at each step to permute the possibilities. Beam Search approximate search strategy B (Beam width) B 2 steps to Beam Search We will illustrate this with an example at each step level. Step — 1 As a part of Step-1 to Beam Search the decoder network outputs the of the top B probabilities and keep them in-memory. Let’s say are the top 3 probable words. Softmax probabilities am, going, visiting Step — 2 As a part of Step-2 to Beam Search we hardwire the Y1 in the decoder network and try to predict the next probable word if Y1 has already occurred. Our aim is to maximize the probability of Y1 and Y2 occurring together. P (Y1, Y2 | X) = P (Y1 | X) * P (Y2 | X, Y1) Here X = x1, x2…. xn (all words in the input) Step — 3 Again to work in step-3, we take top B probable (Y1, Y2) pairs from step-2 and hardwire them in the decoder network and try to find conditional probability of Y3. i.e. P(Y3 | X, Y1, Y2) Similar to previous 2 steps we again find top B probable 3 word sequences and so on. We keep this process till is reached. <EOS> Let’s comment on B If we set then the technique is same as . B = 1 Greedy Search , good chances for better output sequences, but would consume more resources and computation power. Larger the B , not so good results, but much fast and memory efficient. Smaller the B Feel free to comment and share your thoughts. Do share and clap if you ❤ it.