An **Encoder** reads and encodes a source sentence into a **fixed-length vector**. A **Decoder** then outputs a translation from the encoded vector. #### **Limitation** A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a **fixed-length vector**. #### **How Attention solves the problem?** Attention Mechanism allows the decoder to attend to different parts of the source sentence at each step of the output generation. Instead of encoding the input sequence into a **single fixed context vector**, we let the model learn **how to generate a context vector** for each output time step. That is we let the model **learn** what to attend based on the input sentence and what it has produced so far. ### Attention Mechanism Here, the **Encoder** generates **h1,h2,h….hT** from the inputs **X1,X2,X3…XT** Then, we have to find out the **context vector ci** for each of the output time step. #### **How the Context Vector for each output timestep is computed?**  **a** is the **Alignment model** which is a **feedforward neural network** that is trained with all the other components of the proposed system The **Alignment model** scores (e) how well each encoded input (h) matches the current output of the decoder (s).  The alignment scores are normalized using a **softmax function.**  The context vector is a weighted sum of the **annotations** (hj) and **normalized alignment scores.** ### Decoding The Decoder generates output for i’th timestep by looking into the i’th context vector and the previous hidden outputs s(t-1). #### Reference — [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473), 2015.