108 reads

ADA vs C-Mixup: Performance on California and Boston Housing Datasets

by AnchoringNovember 14th, 2024

Too Long; Didn't Read

Experiments on the California and Boston housing datasets show ADA outperforming C-Mixup in low-data settings for nonlinear regression. As data availability increases, the performance gap narrows, suggesting a balance between original data and augmented samples for optimal generalization.

featured image - ADA vs C-Mixup: Performance on California and Boston Housing Datasets

Read by Dr. One voice-avatar

Listen to this story

Authors:

(1) Nora Schneider, Computer Science Department, ETH Zurich, Zurich, Switzerland (nschneide@student.ethz.ch);

(2) Shirin Goshtasbpour, Computer Science Department, ETH Zurich, Zurich, Switzerland and Swiss Data Science Center, Zurich, Switzerland (shirin.goshtasbpour@inf.ethz.ch);

(3) Fernando Perez-Cruz, Computer Science Department, ETH Zurich, Zurich, Switzerland and Swiss Data Science Center, Zurich, Switzerland (fernando.perezcruz@sdsc.ethz.ch).

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Data Augmentation

2.2 Anchor Regression

3 Anchor Data Augmentation

3.1 Comparison to C-Mixup and 3.2 Preserving nonlinear data structure

3.3 Algorithm

4 Experiments and 4.1 Linear synthetic data

4.2 Housing nonlinear regression

4.3 In-distribution Generalization

4.4 Out-of-distribution Robustness

5 Conclusion, Broader Impact, and References

A Additional information for Anchor Data Augmentation

B Experiments

4.2 Housing nonlinear regression

We extend the results from the previous section to the California and Boston housing data and compare ADA to C-Mixup [49]. We repeat the same experiments on three different regression datasets. Results are provided in Appendix B.2 and also show the superiority of ADA over C-Mixup for data augmentation in the implemented experimental setup.

Figure 2: Mean Squared Error for Ridge Regression model and MLP model with varying number of training samples. For Ridge regression, vanilla augmentation and C-Mixup generate k = 10 augmented observations per observations. Similarly, Anchor Augmentation generates k = 10 augmented observations per observation with parameter α = 10.

Data: We use the California housing dataset [19] and the Boston housing dataset [14]. The training dataset contains up to n = 406 samples, and the remaining samples are for validation. We report the results as a function of the number of training points.

Models and comparisons: We fit a ridge regression model (baseline) and train a MLP with one hidden layer with a varying number of hidden units with sigmoid activation. The baseline models only use only the original data. We train the same models using C-Mixup with a Gaussian kernel and bandwidth of 1.75. We compare the previous approaches to models fitted on ADA augmented data. We generate 20 different augmentations per original observation using different values for γ controlled via α = 4 similar to what was described in Section 4.1. The Anchor matrix is constructed using k-means clustering with q = 10.

Results: We report the results in Figure 3. First, we observe that the MLPs outperform Ridge regression suggesting a nonlinear data structure. Second, when the number of training samples is low, applying ADA improves the performance of all models compared to C-Mixup and the baseline. The performance gap decreases as the number of samples increases. When comparing C-Mixup and ADA, we see that using sufficiently many samples both methods achieve similar performance. While on the Boston data, the performance gap between the baseline and ADA persists, on California housing, the non-augmented model fit performs better than the augmented one when data availability increases. This suggests that there is a sweet spot where the addition of original data samples is required for better generalization, and augmented samples cannot contribute any further.

Figure 3: MSE for housing datasets averaged over 10 different train-validation-test splits. On California housing Ridge regression performs much worse which is why it is not considered further (see Appendix B.2).

This paper is available on arxiv under CC0 1.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Anchoring@anchoring

Anchoring provides a steady start, grounding decisions and perspectives in clarity and confidence.

Read my stories About @anchoring

TOPICS

data-science #data-augmentation #anchor-data #anchor-data-augmentation #nonlinear-regression #neural-networks #reinforcement-learning #anchor-regression #regression-models

THIS ARTICLE WAS FEATURED IN...

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

ADA vs C-Mixup: Performance on California and Boston Housing Datasets

Anchoring

@anchoring

Too Long; Didn't Read

Anchoring

STORY’S CREDIBILITY

Academic Research Paper

Table of Links