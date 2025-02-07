262 reads

One Line of Code Can Make AI Models Faster and More Reliable

February 7th, 2025
Researchers are proposing a simple modification to standard ResNet architectures that substantially improves OoD performance on the DDU benchmark.
Authors:

(1) Anonymous authors Paper under double-blind review Jarrod Haas, SARlab, Department of Engineering Science Simon Fraser University; Digitalist Group Canada and [email protected];

(2) William Yolland, MetaOptima and [email protected];

(3) Bernhard Rabus, SARlab, Department of Engineering Science, Simon Fraser University and [email protected].


  • Abstract and 1 Introduction
  • 2 Background
    • 2.1 Problem Definition
    • 2.2 Related Work
    • 2.3 Deep Deterministic Uncertainty
    • 2.4 L2 Normalization of Feature Space and Neural Collapse
  • 3 Methodology
    • 3.1 Models and Loss Functions
    • 3.2 Measuring Neural Collapse
  • 4 Experiments
    • 4.1 Faster and More Robust OoD Results
    • 4.2 Linking Neural Collapse with OoD Detection
  • 5 Conclusion and Future Work, and References
    • A Appendix
    • A.1 Training Details
    • A.2 Effect of L2 Normalization on Softmax Scores for OoD Detection
    • A.3 Fitting GMMs on Logit Space
    • A.4 Overtraining with L2 Normalization
    • A.5 Neural Collapse Measurements for NC Loss Intervention
    • A.6 Additional Figures

5 Conclusion and Future Work

We propose a simple, one-line-of-code modification of the Deep Deterministic Uncertainty benchmark that provides superior OoD detection and classification accuracy results in a fraction of the training time. We also establish that L2 normalization induces NC faster than regular training, and that NC is linked to OoD detection performance under the DDU method. Although we do not suggest that NC is the sole explanation for OoD performance, we do expect that its simple structure can provide insight into the complex and poorly understood behaviour of uncertainty in deep neural networks. We believe that this connection is a compelling area of future research into uncertainty and robustness in DNNs.

A Appendix

A.1 Training Details

All models (except those explicitly noted in the ablation study) use spectral normalization, leaky ReLUs and Global Average Pooling (GAP), as these produce the strongest baselines. Each experiment was conducted with fifteen randomly initialized model parameter sets; no fixed seeds were used at any time for initialization. We set the batch size to 1024 for all training runs, except the NC intervention models, which were more stable when training with a batch size of 2048. All training was conducted on four NVIDIA V100 GPUs in PyTorch 1.10.1 Paszke et al. (2019).


Stochastic gradient descent (SGD) with an initial learning rate of 1e −1 was used as the optimizer for all experiments. We used a learning rate schedule that decreased by one order of magnitude at 150 and 250 epochs for the 350 epoch models, as per the DDU benchmark. We adjust the learning rate at 75 and 90 for the 100 epoch ResNet50 models, and at 40 and 50 for the 60 epoch ResNet18 models. Models were trained on the standard CIFAR-10 training data set with a validation size of 10% created with a fixed random seed.

A.2 Effect of L2 Normalization on Softmax Scores for OoD Detection

(a) The variability of AUROC scores is substantially reduced under L2 normalization of feature space. With much less training, worst case OoD performance across model seeds improves substantially over the baseline, and mean performance improves or is competitive in all cases.


(b) Softmax performs worse in all cases versus GMMs on L2 normalized feature space with a singular exception: SVHN on ResNet 50.


Table 4: OoD detection results using (a) log probabilities from a GMM fitted over feature space and (b) softmax scores. ResNet18 and ResNet50 models were used, 15 seeds per experiment, trained on CIFAR10, with SVHN, CIFAR100 and Tiny ImageNet test sets used as OoD data. For all models, we indicate whether L2 normalization over feature space was used (L2/No L2) and how many training epochs occurred (60/100/350), and compare against baseline (No L2 350). There is no clear pattern of behaviour when using softmax scores for OoD detection, but using GMMs provides superior results.

A.3 Fitting GMMs on Logit Space





Table 4 shows the results of experiments with GMMs fit over logit space. This approach performs worse than GMMs fit over feature space in all cases. Intuitively, this makes sense: even under perfect NC, we would expect OoD inputs to increase the variability of class clusters in arbitrary dimensions of feature space. A Singular Value Decomposition (SVD) over feature space supports our intuitions. In Figure 6, we show the SVD of all training embeddings for CIFAR10, along with the singular values for the test set and SVHN OoD test set projected onto the the same basis used for the training singular values. As we would expect, the first 10 singular values contain nearly all information. However, the latter 502 singular values contain significantly more information in the OoD case. This information is critical to identifying OoD examples in feature space and, due to dimensionality reduction, is severely reduced in logit space.


Table 5: OoD detection results for ResNet18 and ResNet50 models using log probabilities taken from GMMs fitted over logit space instead of feature space (same experimental setup as Table 4). This approach performs worse in all cases versus using GMMs on L2 normalized feature space (see Table 1).


Figure 6: The first 200 singular values for CIFAR10 train and test sets, as well as the SVHN OoD set. As expected, singular value magnitudes fall off drastically after the first 10. Singular value magnitudes for OoD examples remain higher after 10, indicating greater deviation from the Simplex ETF structure. This information is exploited by the GMM to identify OoD examples, and is much less prevalent in the heavily reduced dimensionality of logit space.

A.4 Overtraining with L2 Normalization

Table 6 shows the results of overtraining with L2 Normalization (L2 350). While there is not a substantial penalty for overtraining by 10 to 100 epochs (Figure 5, Right), training for the full 350 epochs (as with the DDU baseline) starts to reduce OoD performance by a few percentage points. We note that there is a tradeoff with accuracy, which does increase when overtraining to 350 epochs.

(a) OoD performance begins to decay substantially when models are overtrained with L2 normalization, in line with our discussion in Section 4.2.2


(b) Accuracy increases slightly when substantially overtraining with L2 normalization, but OoD performance drops. For ResNet50, higher accuracy is achieved in only 100 epochs compared with the baseline.


Table 6: OoD detection (a) and classification accuracy results (b) for ResNet18 and ResNet50 models, 15 seeds per experiment, trained on CIFAR10, with SVHN, CIFAR100 and Tiny ImageNet test sets used as OoD data. For all models, we indicate whether L2 normalization over feature space was used (L2/No L2) and how many training epochs occurred (60/100/350), and compare against baseline (No L2 350).

A.5 Neural Collapse Measurements for NC Loss Intervention

Figure 7: Mean (solid lines) ResNet18 NC measurements from the NC loss (see Equation 3) experiment over 15 model seeds, along with CE loss and classification accuracy. Shading around the solid line shows standard deviation. Blue is the control group, orange is the intervention group. The intervention started at epoch 50, and proceeded for 20 epochs. Over the same training period, the intervention group has substantially more NC, while CE loss and training set classification accuracy are relatively unchanged. This indicates that NC and CE loss effects were controlled successfully.

A.6 Additional Figures

Figure 8: Building intuition about embedding spaces: Training models for low cross-entropy or high accuracy does not necessarily structure embedding space in an optimal way for OoD detection. Left: MNIST training features in a 2-dimensional feature space. The only structure explicitly required by cross-entropy loss is that features are linearly separable. Right: CIFAR10 training features in a 2-dimensional feature space. The network finds it more difficult to make these more complex features linearly separable, but even with spectral normalization, the structure is still arbitrary. In both cases, the within-class variance could be reduced substantially, preserving appropriate notions of sensitivity and smoothness while not affecting classification. These tighter class means could be separated to allow OoD features more inter-class distance to fall awayfrom fitted GMM modes.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


