This story draft by @escholar has not been reviewed by an editor, YET.
This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;
(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;
(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com.
We visualized the attention maps of several transformer-based models that we used for CNN layer generation. Figure 7 shows attention maps for a 2-layer 4-channel CNN network generated using a 1- head 1-layer transformer on MINIIMAGENET(labeled samples are sorted in the order of their episode labels). Attention map for the final logits layer (“CNN Layer 3”) is seen to exhibit a “stairways” pattern indicating that a weight slice Wc,· for episode label c is generated by attending to all samples except for those with label c. This is reminiscent of the supervised learning mechanism outlined in Sec. 3.2. While the proposed mechanism would attend to all samples with label c and average their embeddings, another alternative is to average embeddings of samples with other labels and then invert the result. We hypothesize that the trained transformer performs a similar calculation with additional learned transformer parameters, which may be seen to result in mild fluctuations of the attention to different input samples.
The exact details of these calculations and the generation of intermediate convolutional layers is generally much more difficult to interpret just from the attention maps and a more careful analysis of the trained model is necessary to draw the final conclusions.