This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;
(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;
(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com. Table of Links Abstract and Introduction
Problem Setup and Related Work
HyperTransformer
Experiments
Conclusion and References
A Example of a Self-Attention Mechanism For Supervised Learning
B Model Parameters
C Additional Supervised Experiments
D Dependence On Parameters and Ablation Studies
E Attention Maps of Learned Transformer Models
F Visualization of The Generated CNN Weights
G Additional Tables and Figures 3 HYPERTRANSFORMER In this section, we describe our approach to few-shot learning that we call a HYPERTRANSFORMER (HT) and justify the choice of the self-attention mechanism as its basis. 3.1 FEW-SHOT LEARNING MODEL Along with the input samples, the sequence passed to the transformer was also populated with special learnable placeholder tokens, each associated with a particular slice of the to-be-generated weight tensor. Each such token was a learnable d-dimensional vector padded with zeros to the size of the input sample token. After the entire input sequence was processed by the transformer, we read out model outputs associated with the weight slice placeholder tokens and assembled output weight slices into the final weight tensors (see Fig. 2). Training the model. The weight generation model uses the support set to produce the weights of some or all CNN model layers. Then, the cross-entropy loss is computed for the query set samples that are passed through the generated CNN model. The weight generation parameters φ (including the transformer model and shared/local feature extractor weights) are learned by optimizing this loss function using stochastic gradient descent. 3.2 REASONING BEHIND THE SELF-ATTENTION MECHANISM The choice of self-attention mechanism for the weight generator is not arbitrary. One motivating reason behind this choice is that the output produced by generator with the basic self-attention is by design invariant to input permutations, i.e., permutations of samples in the training dataset. This also makes it suitable for processing unbalanced batches and batches with a variable number of samples (see Sec. 4.3). Now we show that the calculation performed by a self-attention model with properly chosen parameters can mimic basic few-shot learning algorithms further motivating its utility. Supervised learning. Self-attention in its rudimentary form can implement a method similar to cosine-similarity-based sample weighting encoded in the logits layer[3] with weights W: [3] here we assume that the embeddings e are unbiased, i.e., heii = 0 [4] in other words, the self-attention layer should match tokens (µ(i), 0) with (ξ(i), . . .). This paper is available on arxiv under CC 4.0 license. Authors: (1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com; (2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com; (3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com. This paper is available on arxiv under CC 4.0 license. Authors: (1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com; (2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com; (3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com. Table of Links Abstract and Introduction Problem Setup and Related Work HyperTransformer Experiments Conclusion and References A Example of a Self-Attention Mechanism For Supervised Learning B Model Parameters C Additional Supervised Experiments D Dependence On Parameters and Ablation Studies E Attention Maps of Learned Transformer Models F Visualization of The Generated CNN Weights G Additional Tables and Figures Abstract and Introduction Abstract and Introduction Problem Setup and Related Work Problem Setup and Related Work HyperTransformer HyperTransformer Experiments Experiments Conclusion and References Conclusion and References A Example of a Self-Attention Mechanism For Supervised Learning A Example of a Self-Attention Mechanism For Supervised Learning B Model Parameters B Model Parameters C Additional Supervised Experiments C Additional Supervised Experiments D Dependence On Parameters and Ablation Studies D Dependence On Parameters and Ablation Studies E Attention Maps of Learned Transformer Models E Attention Maps of Learned Transformer Models F Visualization of The Generated CNN Weights F Visualization of The Generated CNN Weights G Additional Tables and Figures G Additional Tables and Figures 3 HYPERTRANSFORMER In this section, we describe our approach to few-shot learning that we call a HYPERTRANSFORMER (HT) and justify the choice of the self-attention mechanism as its basis. 3.1 FEW-SHOT LEARNING MODEL Along with the input samples, the sequence passed to the transformer was also populated with special learnable placeholder tokens, each associated with a particular slice of the to-be-generated weight tensor. Each such token was a learnable d-dimensional vector padded with zeros to the size of the input sample token. After the entire input sequence was processed by the transformer, we read out model outputs associated with the weight slice placeholder tokens and assembled output weight slices into the final weight tensors (see Fig. 2). Training the model. The weight generation model uses the support set to produce the weights of some or all CNN model layers. Then, the cross-entropy loss is computed for the query set samples that are passed through the generated CNN model. The weight generation parameters φ (including the transformer model and shared/local feature extractor weights) are learned by optimizing this loss function using stochastic gradient descent. Training the model. 3.2 REASONING BEHIND THE SELF-ATTENTION MECHANISM The choice of self-attention mechanism for the weight generator is not arbitrary. One motivating reason behind this choice is that the output produced by generator with the basic self-attention is by design invariant to input permutations, i.e., permutations of samples in the training dataset. This also makes it suitable for processing unbalanced batches and batches with a variable number of samples (see Sec. 4.3). Now we show that the calculation performed by a self-attention model with properly chosen parameters can mimic basic few-shot learning algorithms further motivating its utility. Supervised learning. Self-attention in its rudimentary form can implement a method similar to cosine-similarity-based sample weighting encoded in the logits layer[3] with weights W: Supervised learning. [3] here we assume that the embeddings e are unbiased, i.e., heii = 0 [4] in other words, the self-attention layer should match tokens (µ(i), 0) with (ξ(i), . . .).

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

15 Common Types of Unethical Behavior Found in Open-Source Projects

HyperTransformer: Abstract and Introduction

HyperTransformer: Problem Setup and Related Work

HyperTransformer: B Model Parameters