Authors:
(1) Seokil Ham, KAIST;
(2) Jungwuk Park, KAIST;
(3) Dong-Jun Han, Purdue University;
(4) Jaekyun Moon, KAIST.
3. Proposed NEO-KD Algorithm and 3.1 Problem Setup: Adversarial Training in Multi-Exit Networks
4. Experiments and 4.1 Experimental Setup
4.2. Main Experimental Results
4.3. Ablation Studies and Discussions
5. Conclusion, Acknowledgement and References
B. Clean Test Accuracy and C. Adversarial Training via Average Attack
E. Discussions on Performance Degradation at Later Exits
F. Comparison with Recent Defense Methods for Single-Exit Networks
G. Comparison with SKD and ARD and H. Implementations of Stronger Attacker Algorithms
In the NEO-KD objective function, there are three hyperparameters (α, β, γ), where α, β control the amount of distilling knowledge from NKD, EOKD and γ increases the amount of knowledge distilled to later exits.
The extreme value of α and β can destroy ideal adversarial training. Too large α makes strong NKD, which results in high dependency among submodels and too small α makes weak NKD, which cannot distill enough knowledge to student exits. In contrast, too large β makes strong EOKD, which can interrupt adversarial training by distilling only sparse knowledge (likelihoods of majority classes are zero) and too small β makes weak EOKD, which cannot mitigate dependency among submodels. We select α, β values in the range of [0.35, 3] and measure the adversarial test accuracy value by averaging adversarial test accuracy from all exits. The candidate (α, β) pairs are (0.35, 1), (1, 0.35), (0.35, 0.35), (0.5, 1), (1, 0.5), (0.5, 0.5), (1, 1), (2, 1), (1, 2), (2, 2), (3, 1), (1, 3), and (3, 3). When (α, β) is (3, 1), NEO-KD achieves 28.96% of adversarial test accuracy against max-average attack and 22.88% against average attack, which is the highest adversarial test accuracy among the various candidate (α, β) pairs. Therefore, we use (3, 1) as (α, β) pair in our experiments.
Since the prediction difference between the last exit (teacher prediction) and later exits is smaller than the prediction difference between the last exit and early exits, later exits are less effective for taking advantage of knowledge distillation. Therefore, we provide slightly larger weights to later exits for distilling more knowledge to later exits than early exits. The candidate γ values are [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1.5, 1.5, 1.5, 1.5], and [1, 1, 1, 1.7, 1.7, 1.7, 1.7]. As a result, when we distill 1.5 times more knowledge to later exits, NEO-KD achieves 28.96% of adversarial test accuracy against max-average attack and 22.88% against average attack, which is the highest adversarial test accuracy compared to providing same weights with earlier exits to later exits (28.13% for max-average and 21.66% for average attack) or distilling 1.7 times more knowledge to later exits than earlier exits (28.68% for max-average and 22.58% for average attack). The adversarial test accuracy value is the average of adversarial test accuracies from all exits. Therefore, we use γ = [1, 1, 1, 1.5, 1.5, 1.5, 1.5] in our experiments. This result proves that the exit-balancing parameter γ with an appropriate value is needed for high performance.
This paper is available on arxiv under CC 4.0 license.