137 reads

A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

by Auto Encoder: How to Ignore the Signal NoiseJune 19th, 2024

Too Long; Didn't Read

Reparameterising value and projection parameters in linear layers via the duality between downweighted residuals and restricted updates optimizes learning rates and model performance.

featured image - A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

‘neural network’ Image created by HackerNoon AI Image Generator

Authors:

(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.);

(2) Thomas Hofmann, Department of Computer Science, ETH Zurich.

Table of Links

Abstract and Introduction

Related Work

Preliminaries

Simplifying Transformer Blocks

Further Experimental Analysis

Discussion, Reproducibility Statement, Acknowledgements and References

A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

B Block Layouts

C Additional Experiments

D Implementation Details

A DUALITY BETWEEN DOWNWEIGHTED RESIDUALS AND RESTRICTING UPDATES IN LINEAR LAYERS

In Sec. 4.1, we motivated our reparameterisation of the value and projection parameters, Eq. (6), through a duality between downweighted residuals branches and restricting parameter updates (materialised through smaller learning rates) in linear layers. This is a relatively simple argument, found elsewhere in the literature e.g. Ding et al. (2023), which we outline here for completeness.

We suppose we have a (differentiable) loss function L(W), which is a function of some parameter matrix W. We consider taking a gradient step to minimise L, with learning rate ηW from initialisation W0. This would give new parameters W1:

This paper is available on arxiv under CC 4.0 license.

L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Read my stories Learn More

TOPICS

machine-learning #deep-learning #transformer-architecture #simplified-transformer-blocks #neural-network-efficiency #deep-transformers #signal-propagation-theory #neural-network-architecture #transformer-efficiency

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

Too Long; Didn't Read

Table of Links

A DUALITY BETWEEN DOWNWEIGHTED RESIDUALS AND RESTRICTING UPDATES IN LINEAR LAYERS

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES