Simplifying Transformer Blocks: Block Layouts

by Auto Encoder: How to Ignore the Signal NoiseJune 19th, 2024

Too Long; Didn't Read

Reparameterising value and projection parameters in linear layers via the duality between downweighted residuals and restricted updates optimizes learning rates and model performance.

featured image - Simplifying Transformer Blocks: Block Layouts

Authors:

(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.);

(2) Thomas Hofmann, Department of Computer Science, ETH Zurich.

Table of Links

Abstract and Introduction

Related Work

Preliminaries

Simplifying Transformer Blocks

Further Experimental Analysis

Discussion, Reproducibility Statement, Acknowledgements and References

A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

B Block Layouts

C Additional Experiments

D Implementation Details

B BLOCK LAYOUTS

In Fig. 9 and Fig. 10 we show the layouts of our SAS block (Sec. 4.2) and parallel SAS-P block (Sec. 4.3). These are the equivalent plots to the layouts in Fig. 1. Mathematically, our SAS attention sub-block computes (in the notation of Eq. (2)):

This paper is available on arxiv under CC 4.0 license.

L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Read my stories Learn More

TOPICS

machine-learning #deep-learning #transformer-architecture #simplified-transformer-blocks #neural-network-efficiency #deep-transformers #signal-propagation-theory #neural-network-architecture #transformer-efficiency

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Simplifying Transformer Blocks: Block Layouts

Too Long; Didn't Read

Table of Links

B BLOCK LAYOUTS

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES