Authors:
(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.);
(2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Related Work Preliminaries Simplifying Transformer Blocks Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts C Additional Experiments D Implementation Details 3 PRELIMINARIES A deep transformer architecture of depth L is formed by sequentially stacking L transformer blocks. The most common block is Pre-LN, depicted in Fig. 1 (left), which we treat as a baseline for comparing training speed, both in terms of per-update and runtime. It differs from the original Post-LN block only in the position of the normalisation layers relative to the skip connections, but is more popular as the Post-LN block suffers from poor training stability and signal propagation in deep layers (Xiong et al., 2020; Liu et al., 2020; Noci et al., 2022; He et al., 2023). with scalar gain weights αFF, βFF, αSA, βSA fixed to 1 by default. Here, “MHA” stands for MultiHead Attention (detailed below), and “Norm” denotes a normalisation layer (Ba et al., 2016; Zhang & Sennrich, 2019). In words, we see that the Pre-LN transformer block consists of two sequential sub-blocks (one attention and one MLP), with normalisation layers and residual connections for both sub-blocks, and crucially the normalisation layers are placed within the residual branch. The MLP is usually single hidden-layer, with hidden dimension that is some multiple of d (e.g. 4 (Vaswani et al., 2017) or 8/3 (Touvron et al., 2023)), and acts on each token in the sequence independently. The MHA sub-block allows tokens to share information between one another using self-attention. For input sequence X, the self-attention mechanism outputs: This paper is available on arxiv under CC 4.0 license. Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Authors: Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Abstract and Introduction Related Work Related Work Preliminaries Preliminaries Simplifying Transformer Blocks Simplifying Transformer Blocks Further Experimental Analysis Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts B Block Layouts C Additional Experiments C Additional Experiments D Implementation Details D Implementation Details 3 PRELIMINARIES A deep transformer architecture of depth L is formed by sequentially stacking L transformer blocks. The most common block is Pre-LN, depicted in Fig. 1 (left), which we treat as a baseline for comparing training speed, both in terms of per-update and runtime. It differs from the original Post-LN block only in the position of the normalisation layers relative to the skip connections, but is more popular as the Post-LN block suffers from poor training stability and signal propagation in deep layers (Xiong et al., 2020; Liu et al., 2020; Noci et al., 2022; He et al., 2023). with scalar gain weights αFF, βFF, αSA, βSA fixed to 1 by default. Here, “MHA” stands for MultiHead Attention (detailed below), and “Norm” denotes a normalisation layer (Ba et al., 2016; Zhang & Sennrich, 2019). In words, we see that the Pre-LN transformer block consists of two sequential sub-blocks (one attention and one MLP), with normalisation layers and residual connections for both sub-blocks, and crucially the normalisation layers are placed within the residual branch. The MLP is usually single hidden-layer, with hidden dimension that is some multiple of d (e.g. 4 (Vaswani et al., 2017) or 8/3 (Touvron et al., 2023)), and acts on each token in the sequence independently. The MHA sub-block allows tokens to share information between one another using self-attention. For input sequence X, the self-attention mechanism outputs: This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Improving Training Stability in Deep Transformers: Pre-LN vs. Post-LN Blocks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

12 Key Aspects for Assessing the Power of Text-to-Image Models

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps