Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Related Work Preliminaries Simplifying Transformer Blocks Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts C Additional Experiments D Implementation Details 3 PRELIMINARIES A deep transformer architecture of depth L is formed by sequentially stacking L transformer blocks. The most common block is Pre-LN, depicted in Fig. 1 (left), which we treat as a baseline for comparing training speed, both in terms of per-update and runtime. It differs from the original Post-LN block only in the position of the normalisation layers relative to the skip connections, but is more popular as the Post-LN block suffers from poor training stability and signal propagation in deep layers (Xiong et al., 2020; Liu et al., 2020; Noci et al., 2022; He et al., 2023). with scalar gain weights αFF, βFF, αSA, βSA fixed to 1 by default. Here, “MHA” stands for MultiHead Attention (detailed below), and “Norm” denotes a normalisation layer (Ba et al., 2016; Zhang & Sennrich, 2019). In words, we see that the Pre-LN transformer block consists of two sequential sub-blocks (one attention and one MLP), with normalisation layers and residual connections for both sub-blocks, and crucially the normalisation layers are placed within the residual branch. The MLP is usually single hidden-layer, with hidden dimension that is some multiple of d (e.g. 4 (Vaswani et al., 2017) or 8/3 (Touvron et al., 2023)), and acts on each token in the sequence independently. The MHA sub-block allows tokens to share information between one another using self-attention. For input sequence X, the self-attention mechanism outputs: This paper is available on arxiv under CC 4.0 license. Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Authors: Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Abstract and Introduction Related Work Related Work Preliminaries Preliminaries Simplifying Transformer Blocks Simplifying Transformer Blocks Further Experimental Analysis Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts B Block Layouts C Additional Experiments C Additional Experiments D Implementation Details D Implementation Details 3 PRELIMINARIES A deep transformer architecture of depth L is formed by sequentially stacking L transformer blocks. The most common block is Pre-LN, depicted in Fig. 1 (left), which we treat as a baseline for comparing training speed, both in terms of per-update and runtime. It differs from the original Post-LN block only in the position of the normalisation layers relative to the skip connections, but is more popular as the Post-LN block suffers from poor training stability and signal propagation in deep layers (Xiong et al., 2020; Liu et al., 2020; Noci et al., 2022; He et al., 2023). with scalar gain weights αFF, βFF, αSA, βSA fixed to 1 by default. Here, “MHA” stands for MultiHead Attention (detailed below), and “Norm” denotes a normalisation layer (Ba et al., 2016; Zhang & Sennrich, 2019). In words, we see that the Pre-LN transformer block consists of two sequential sub-blocks (one attention and one MLP), with normalisation layers and residual connections for both sub-blocks, and crucially the normalisation layers are placed within the residual branch. The MLP is usually single hidden-layer, with hidden dimension that is some multiple of d (e.g. 4 (Vaswani et al., 2017) or 8/3 (Touvron et al., 2023)), and acts on each token in the sequence independently. The MHA sub-block allows tokens to share information between one another using self-attention. For input sequence X, the self-attention mechanism outputs: This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv