Authors:
(1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.);
(2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Related Work Preliminaries Simplifying Transformer Blocks Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts C Additional Experiments D Implementation Details A DUALITY BETWEEN DOWNWEIGHTED RESIDUALS AND RESTRICTING UPDATES IN LINEAR LAYERS In Sec. 4.1, we motivated our reparameterisation of the value and projection parameters, Eq. (6), through a duality between downweighted residuals branches and restricting parameter updates (materialised through smaller learning rates) in linear layers. This is a relatively simple argument, found elsewhere in the literature e.g. Ding et al. (2023), which we outline here for completeness. We suppose we have a (differentiable) loss function L(W), which is a function of some parameter matrix W. We consider taking a gradient step to minimise L, with learning rate ηW from initialisation W0. This would give new parameters W1: This paper is available on arxiv under CC 4.0 license. Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Authors: Authors: (1) Bobby He, Department of Computer Science, ETH Zurich (Correspondence to: bobby.he@inf.ethz.ch.); (2) Thomas Hofmann, Department of Computer Science, ETH Zurich. Table of Links Abstract and Introduction Abstract and Introduction Related Work Related Work Preliminaries Preliminaries Simplifying Transformer Blocks Simplifying Transformer Blocks Further Experimental Analysis Further Experimental Analysis Discussion, Reproducibility Statement, Acknowledgements and References Discussion, Reproducibility Statement, Acknowledgements and References A Duality Between Downweighted Residual and Restricting Updates In Linear Layers A Duality Between Downweighted Residual and Restricting Updates In Linear Layers B Block Layouts B Block Layouts C Additional Experiments C Additional Experiments D Implementation Details D Implementation Details A DUALITY BETWEEN DOWNWEIGHTED RESIDUALS AND RESTRICTING UPDATES IN LINEAR LAYERS In Sec. 4.1, we motivated our reparameterisation of the value and projection parameters, Eq. (6), through a duality between downweighted residuals branches and restricting parameter updates (materialised through smaller learning rates) in linear layers. This is a relatively simple argument, found elsewhere in the literature e.g. Ding et al. (2023), which we outline here for completeness. We suppose we have a (differentiable) loss function L(W), which is a function of some parameter matrix W. We consider taking a gradient step to minimise L, with learning rate ηW from initialisation W0. This would give new parameters W1: This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

A Duality Between Downweighted Residual and Restricting Updates In Linear Layers

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

12 Key Aspects for Assessing the Power of Text-to-Image Models

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps