180 reads

Improving Training Stability in Deep Transformers: Pre-LN vs. Post-LN Blocks

by Auto Encoder: How to Ignore the Signal NoiseJune 19th, 2024

Too Long; Didn't Read

Pre-LN transformer blocks, differing from Post-LN by normalisation layer placement, enhance training stability and signal propagation in deep transformers, offering better performance in deep layers.

featured image - Improving Training Stability in Deep Transformers: Pre-LN vs. Post-LN Blocks

‘neural network’ Image created by HackerNoon AI Image Generator

Read by Dr. One voice-avatar

Listen to this story

Authors:

(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.);

(2) Thomas Hofmann, Department of Computer Science, ETH Zurich.

Table of Links

Abstract and Introduction

Related Work

Preliminaries

Simplifying Transformer Blocks

Further Experimental Analysis

Discussion, Reproducibility Statement, Acknowledgements and References

A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

B Block Layouts

C Additional Experiments

D Implementation Details

3 PRELIMINARIES

A deep transformer architecture of depth L is formed by sequentially stacking L transformer blocks. The most common block is Pre-LN, depicted in Fig. 1 (left), which we treat as a baseline for comparing training speed, both in terms of per-update and runtime. It differs from the original Post-LN block only in the position of the normalisation layers relative to the skip connections, but is more popular as the Post-LN block suffers from poor training stability and signal propagation in deep layers (Xiong et al., 2020; Liu et al., 2020; Noci et al., 2022; He et al., 2023).

with scalar gain weights αFF, βFF, αSA, βSA fixed to 1 by default. Here, “MHA” stands for MultiHead Attention (detailed below), and “Norm” denotes a normalisation layer (Ba et al., 2016; Zhang & Sennrich, 2019). In words, we see that the Pre-LN transformer block consists of two sequential sub-blocks (one attention and one MLP), with normalisation layers and residual connections for both sub-blocks, and crucially the normalisation layers are placed within the residual branch. The MLP is usually single hidden-layer, with hidden dimension that is some multiple of d (e.g. 4 (Vaswani et al., 2017) or 8/3 (Touvron et al., 2023)), and acts on each token in the sequence independently.

The MHA sub-block allows tokens to share information between one another using self-attention. For input sequence X, the self-attention mechanism outputs:

This paper is available on arxiv under CC 4.0 license.

L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Read my stories Learn More

TOPICS

machine-learning #deep-learning #transformer-architecture #simplified-transformer-blocks #neural-network-efficiency #deep-transformers #signal-propagation-theory #neural-network-architecture #transformer-efficiency

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Also published here

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Improving Training Stability in Deep Transformers: Pre-LN vs. Post-LN Blocks

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Too Long; Didn't Read

Auto Encoder: How to Ignore the Signal Noise

STORY’S CREDIBILITY

Academic Research Paper

Table of Links