Authors:
(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: [email protected].);
(2) Thomas Hofmann, Department of Computer Science, ETH Zurich.
Simplifying Transformer Blocks
Discussion, Reproducibility Statement, Acknowledgements and References
A Duality Between Downweighted Residual and Restricting Updates In Linear Layers
In Sec. 4.1, we motivated our reparameterisation of the value and projection parameters, Eq. (6), through a duality between downweighted residuals branches and restricting parameter updates (materialised through smaller learning rates) in linear layers. This is a relatively simple argument, found elsewhere in the literature e.g. Ding et al. (2023), which we outline here for completeness.
We suppose we have a (differentiable) loss function L(W), which is a function of some parameter matrix W. We consider taking a gradient step to minimise L, with learning rate ηW from initialisation W0. This would give new parameters W1:
This paper is available on arxiv under CC 4.0 license.