Authors:
(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: [email protected].);
(2) Thomas Hofmann, Department of Computer Science, ETH Zurich.
Simplifying Transformer Blocks
Discussion, Reproducibility Statement, Acknowledgements and References
A Duality Between Downweighted Residual and Restricting Updates In Linear Layers
In Fig. 9 and Fig. 10 we show the layouts of our SAS block (Sec. 4.2) and parallel SAS-P block (Sec. 4.3). These are the equivalent plots to the layouts in Fig. 1. Mathematically, our SAS attention sub-block computes (in the notation of Eq. (2)):
This paper is available on arxiv under CC 4.0 license.