paint-brush
The Impact of Hyperparameters on Adversarial Training Performanceby@textmodels

The Impact of Hyperparameters on Adversarial Training Performance

tldt arrow

Too Long; Didn't Read

The Hyperparameter Tuning section details the critical role of the hyperparameters α, β, and γ in the NEO-KD objective function. Extreme values can hinder adversarial training, with optimal values for α and β found to be (3, 1), achieving the highest adversarial test accuracy against both max-average and average attacks. The exit-balancing parameter γ is set to [1, 1, 1, 1.5, 1.5, 1.5, 1.5], optimizing knowledge distillation to later exits, confirming the importance of balanced hyperparameter selection for enhanced performance.
featured image - The Impact of Hyperparameters on Adversarial Training Performance
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Authors:

(1) Seokil Ham, KAIST;

(2) Jungwuk Park, KAIST;

(3) Dong-Jun Han, Purdue University;

(4) Jaekyun Moon, KAIST.

Abstract and 1. Introduction

2. Related Works

3. Proposed NEO-KD Algorithm and 3.1 Problem Setup: Adversarial Training in Multi-Exit Networks

3.2 Algorithm Description

4. Experiments and 4.1 Experimental Setup

4.2. Main Experimental Results

4.3. Ablation Studies and Discussions

5. Conclusion, Acknowledgement and References

A. Experiment Details

B. Clean Test Accuracy and C. Adversarial Training via Average Attack

D. Hyperparameter Tuning

E. Discussions on Performance Degradation at Later Exits

F. Comparison with Recent Defense Methods for Single-Exit Networks

G. Comparison with SKD and ARD and H. Implementations of Stronger Attacker Algorithms

D Hyperparameter Tuning

In the NEO-KD objective function, there are three hyperparameters (α, β, γ), where α, β control the amount of distilling knowledge from NKD, EOKD and γ increases the amount of knowledge distilled to later exits.

D.1 Hyperparameter (α, β)

The extreme value of α and β can destroy ideal adversarial training. Too large α makes strong NKD, which results in high dependency among submodels and too small α makes weak NKD, which cannot distill enough knowledge to student exits. In contrast, too large β makes strong EOKD, which can interrupt adversarial training by distilling only sparse knowledge (likelihoods of majority classes are zero) and too small β makes weak EOKD, which cannot mitigate dependency among submodels. We select α, β values in the range of [0.35, 3] and measure the adversarial test accuracy value by averaging adversarial test accuracy from all exits. The candidate (α, β) pairs are (0.35, 1), (1, 0.35), (0.35, 0.35), (0.5, 1), (1, 0.5), (0.5, 0.5), (1, 1), (2, 1), (1, 2), (2, 2), (3, 1), (1, 3), and (3, 3). When (α, β) is (3, 1), NEO-KD achieves 28.96% of adversarial test accuracy against max-average attack and 22.88% against average attack, which is the highest adversarial test accuracy among the various candidate (α, β) pairs. Therefore, we use (3, 1) as (α, β) pair in our experiments.

D.2 Hyperparameter γ

Since the prediction difference between the last exit (teacher prediction) and later exits is smaller than the prediction difference between the last exit and early exits, later exits are less effective for taking advantage of knowledge distillation. Therefore, we provide slightly larger weights to later exits for distilling more knowledge to later exits than early exits. The candidate γ values are [1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1.5, 1.5, 1.5, 1.5], and [1, 1, 1, 1.7, 1.7, 1.7, 1.7]. As a result, when we distill 1.5 times more knowledge to later exits, NEO-KD achieves 28.96% of adversarial test accuracy against max-average attack and 22.88% against average attack, which is the highest adversarial test accuracy compared to providing same weights with earlier exits to later exits (28.13% for max-average and 21.66% for average attack) or distilling 1.7 times more knowledge to later exits than earlier exits (28.68% for max-average and 22.58% for average attack). The adversarial test accuracy value is the average of adversarial test accuracies from all exits. Therefore, we use γ = [1, 1, 1, 1.5, 1.5, 1.5, 1.5] in our experiments. This result proves that the exit-balancing parameter γ with an appropriate value is needed for high performance.


This paper is available on arxiv under CC 4.0 license.