This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;
(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;
(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com.
Table of Links
- Abstract and Introduction
- Problem Setup and Related Work
- HyperTransformer
- Experiments
- Conclusion and References
- A Example of a Self-Attention Mechanism For Supervised Learning
- B Model Parameters
- C Additional Supervised Experiments
- D Dependence On Parameters and Ablation Studies
- E Attention Maps of Learned Transformer Models
- F Visualization of The Generated CNN Weights
- G Additional Tables and Figures
C ADDITIONAL SUPERVISED EXPERIMENTS
While the advantage of decoupling parameters of the weight generator and the generated CNN model is expected to vanish with the growing CNN model size, we compared our approach to two other methods, LGM-Net (Li et al., 2019b) and LEO (Rusu et al., 2019), to verify that our approach can match their performance on sufficiently large models.
For our comparison with the LGM-Net method, we used the same image augmentation technique that was used in Li et al. (2019b) where it was applied both at the training and the evaluation stages (Ma, 2019). We also used the same CNN architecture with 4 learned 64-channel convolutional layers followed by two generated convolutional layers and the final logits layer. In our weight generator, we used 2-layer transformers with local feature extractors that relied on 48-channel convolutional layers and did not use any global features. We trained our model in the end-to-end fashion on the MINIIMAGENET 1-shot-5-way task and obtained a test accuracy of 69.3% ± 0.3% almost identical to the 69.1% accuracy reported in Li et al. (2019b).
We also carried out a comparison with LEO by using our method to generate a fully-connected layer on top of the TIEREDIMAGENET embeddings pre-computed with a WideResNet-28 model employed by Rusu et al. (2019). For our experiments, we used a simpler 1-layer transformer model with 2 heads that did not have the final fully-connected layer and nonlinearity. We also used L2 regularization of the generated fully-connected weights setting the regularization weight to 10−3. As a result of training this model, we obtained 66.2% ± 0.2% and 81.6% ± 0.2% test accuracies on the 1-shot-5-way and 5-shot-5-way TIEREDIMAGENET tasks correspondingly. These results are almost identical to 66.3% and 81.4% accuracies reported in Rusu et al. (2019).