HyperTransformer: Problem Setup and Related Work

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;

(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;

(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com.

Table of Links

Problem Setup and Related Work

2.1 FEW-SHOT LEARNING

2.2 RELATED WORK

Few-shot learning received a lot of attention from the deep learning community and while there are hundreds of few-shot learning methods, several common themes emerged in the past years. Here we outline several existing approaches, show how they relate to our method and discuss the prior work related to it.

Metric-Based Learning. One family of approaches involves mapping input samples into an embedding space and then using some nearest neighbor algorithm that relies on the computation of distances from a query sample embedding to the embedding computed using support samples with known labels. The metric used to compute the distance can either be the same for all tasks, or can be task-dependent. This family of methods includes, for example, such methods as Siamese networks (Koch et al., 2015), Matching Networks (Vinyals et al., 2016), Prototypical Networks (Snell et al., 2017), Relation Networks (Sung et al., 2018) and TADAM (Oreshkin et al., 2018). It has recently been argued (Tian et al., 2020) that methods based on building a powerful sample representation can frequently outperform numerous other approaches including many optimization-based methods. However, such approaches essentially amount to the “one-model solves all” approach and thus require larger models than needed to solve individual tasks.

Weight Modulation and Generation. The idea of using a task specification to directly generate or

modulate model weights has been previously explored in the generalized supervised learning context (Ratzlaff & Li, 2019) and in specific language models (Mahabadi et al., 2021; Tay et al., 2021; Ye & Ren, 2021). Some few-shot learning methods described above also employ this approach and use task-specific generation or modulation of the weights of the final classification model. For example, in LGM-Net (Li et al., 2019b) the matching network approach is used to generate a few layers on top of a task-agnostic embedding. Another approach abbreviated as LEO (Rusu et al., 2019) utilized a similar weight generation method to generate initial model weights from the training dataset in a few-shot learning setting, much like what is proposed in this article. However, in Rusu et al. (2019), the generated weights were also refined using several SGD steps similar to how it is done in MAML. Here we explore a similar idea, but largely inspired by the HYPERNETWORK approach (Ha et al., 2017), we instead propose to directly generate an entire task-specific CNN model. Unlike LEO, we do not rely on pre-computed embeddings for images and generate the model in a single step without additional SGD steps, which simplifies and stabilizes training.

Transformers in Computer Vision and Few-Shot Learning. Transformer models (Vaswani et al., 2017) originally proposed for natural language understanding applications, had since become a useful tool in practically every subfield of deep learning. In computer vision, transformers have recently seen an explosion of applications ranging from state-of-the-art image classification results (Dosovitskiy et al., 2021; Touvron et al., 2021) to object detection (Carion et al., 2020; Zhu et al., 2021), segmentation (Ye et al., 2019), image super-resolution (Yang et al., 2020), image generation (Chen et al., 2021) and many others. There are also several notable applications in few-shot image classification. For example, in Liu et al. (2021), the transformer model was used for generating universal representations in the multi-domain few-shot learning scenario. And closely related to our approach, in Ye et al. (2020), the authors proposed to accomplish embedding adaptation with the help of transformer models. Unlike our method that generates an entire end to-end image classification model, this approach applied a task-dependent perturbation to an embedding generated by an independent task-agnostic feature extractor. In Gidaris & Komodakis (2018), a simplified attention-based model was used for the final layer generation.