Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Related Work Preliminaries Simplifying Transformer Blocks Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts C Additional Experiments D Implementation Details A DUALITY BETWEEN DOWNWEIGHTED RESIDUALS AND RESTRICTING UPDATES IN LINEAR LAYERS In Sec. 4.1, we motivated our reparameterisation of the value and projection parameters, Eq. (6), through a duality between downweighted residuals branches and restricting parameter updates (materialised through smaller learning rates) in linear layers. This is a relatively simple argument, found elsewhere in the literature e.g. Ding et al. (2023), which we outline here for completeness. We suppose we have a (differentiable) loss function L(W), which is a function of some parameter matrix W. We consider taking a gradient step to minimise L, with learning rate ηW from initialisation W0. This would give new parameters W1: This paper is available on arxiv under CC 4.0 license. Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Authors: Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Abstract and Introduction Related Work Related Work Preliminaries Preliminaries Simplifying Transformer Blocks Simplifying Transformer Blocks Further Experimental Analysis Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts B Block Layouts C Additional Experiments C Additional Experiments D Implementation Details D Implementation Details A DUALITY BETWEEN DOWNWEIGHTED RESIDUALS AND RESTRICTING UPDATES IN LINEAR LAYERS In Sec. 4.1, we motivated our reparameterisation of the value and projection parameters, Eq. (6), through a duality between downweighted residuals branches and restricting parameter updates (materialised through smaller learning rates) in linear layers. This is a relatively simple argument, found elsewhere in the literature e.g. Ding et al. (2023), which we outline here for completeness. We suppose we have a (differentiable) loss function L(W), which is a function of some parameter matrix W. We consider taking a gradient step to minimise L, with learning rate ηW from initialisation W0. This would give new parameters W1: This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv