paint-brush
Researchers Invent Lightning-Fast AI Boost for Small, Complex Datasetsby@procrustes
105 reads

Researchers Invent Lightning-Fast AI Boost for Small, Complex Datasets

by Procrustes Technologies
Procrustes Technologies HackerNoon profile picture

Procrustes Technologies

@procrustes

Procrustes' method aligns and adjusts, making data conform, with precision...

January 27th, 2025
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers have developed a new method to generate additional data points by utilizing cross-validation resampling and latent variable modeling to train artificial intelligence.
featured image - Researchers Invent Lightning-Fast AI Boost for Small, Complex Datasets
1x
Read by Dr. One voice-avatar

Listen to this story

Procrustes Technologies HackerNoon profile picture
Procrustes Technologies

Procrustes Technologies

@procrustes

Procrustes' method aligns and adjusts, making data conform, with precision and control, in the realm of math and shape.

About @procrustes
LEARN MORE ABOUT @PROCRUSTES'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Sergey Kucheryavskiy, Department of Chemistry and Bioscience, Aalborg University and a Corresponding author (svk@bio.aau.dk);

(2) Sergei Zhilin, CSort, LLC., Germana Titova st. 7, Barnaul, 656023, Russia and Contributing authors0 (szhilin@gmail.com).

Editor's note: This is Part 4 of 4 of a study detailing a new method for the augmentation of numeric and mixed datasets. Read the rest below.

  • Abstract and 1 Introduction
  • 2 Methods
    • 2.1 Generation of PV-sets based on Singular Value Decomposition
    • 2.2 Generation of PV-sets based on PLS decomposition
  • 3 Results
    • 3.1 Datasets
    • 3.2 ANN regression of Tecator data
    • 3.3 ANN classification of Heart data
  • 4 Discussion
    • 5 Conclusions and References

4 Discussion

The experimental results confirm the benefits of PV-set augmentation, however optimization of ANN learning parameters is needed to make the benefits significant. At the same time, optimization of the PV-set generation algorithm is not necessary for


image


most of the cases. Based on our experiments, we advise using cross-validation resampling with 5 or 10 splits and a number of latent variables large enough to capture the majority of variation in X. In some specific cases one can use tools for quality control of generated PV-sets described in [10].


It must also be noted that the use of PV-sets for data augmentation is not always beneficial. Thus, according to our experiments, which are not reported here, in methods that are robust to overfitting, such as, for example, random forest (RF), increasing the training set artificially does not have a significant effect on the model performance. In the case of eXtreme Gradient Boosting changing training parameters, which regulate the overfitting, such as a learning rate, maximum depth and minimum sum of instance weight, can have an effect, but most of the time the effect observed in our experiments was marginal.

5 Conclusions

This paper proposes a new method for data augmentation. The method is beneficial specifically for datasets with moderate to high degree of collinearity as it directly utilizes this feature in the generation algorithm.


Two proposed implementations of the method (SVD and PLS based) cover most of the common data analysis tasks, such as regression, discrimination and one-class classification (authentication). Both implementations are very fast — the generation of a PV-set for X of 200×500 with 20 latent variables and 10 segments splits requires several seconds (less than a second on a powerful PC), much less than the training of an ANN model with several layers.


The method can work with datasets of small size (from tens observations) and can be used for both numeric and mixed datasets, where one or several variables are categorical.

References

1] Ratner, A. J., Ehrenberg, H. R., Hussain, Z., Dunnmon, J. & R´e, C. Learning to compose domain-specific transformations for data augmentation (2017). 1709. 01643.


[2] Goodfellow, I. J. et al. Generative adversarial networks (2014). 1406.2661.


[3] Dao, T. et al. A kernel theory of modern data augmentation (2019). 1803.06084.


[4] Perez, E. & Ventura, S. Progressive growing of generative adversarial networks for improving data augmentation and skin cancer diagnosis. Artificial Intelligence in Medicine 141, 102556 (2023). URL https://www.sciencedirect.com/science/ article/pii/S0933365723000702.


[5] Perez, F., Vasconcelos, C., Avila, S. & Valle, E. Stoyanov, D. et al. (eds) Data augmentation for skin lesion analysis. (eds Stoyanov, D. et al.) OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, 303–311 (Springer International Publishing, Cham, 2018).


[6] Iglesias, G., Talavera, E., Gonzalez-Prieto, A., Mozo, A. & Gomez-Canaval, S. Data augmentation techniques in time series domain: a survey and taxonomy. Neural Computing and Applications 35, 10123–10145 (2023). URL https://doi. org/10.1007/s00521-023-08459-3.


[7] Saiz-Abajo, M., Mevik, B.-H., Segtnan, V. & Næs, T. Ensemble methods and data augmentation by noise addition applied to the analysis of spectroscopic data. Analytica Chimica Acta 533, 147–159 (2005). URL https://www.sciencedirect. com/science/article/pii/S000326700401428X.


[8] Chadebec, C. & Allassonniere, S. Data augmentation with variational autoencoders and manifold sampling (2021). 2103.13751.


[9] Kucheryavskiy, S., Zhilin, S., Rodionova, O. & Pomerantsev, A. Procrustes cross-validation—a bridge between cross-validation and independent validation sets. Analytical Chemistry 92, 11842–11850 (2020).


[10] Kucheryavskiy, S., Rodionova, O. & Pomerantsev, A. Procrustes cross-validation of multivariate regression models. Analytica Chimica Acta 1255, 341096 (2023). URL https://www.sciencedirect.com/science/article/pii/S0003267023003173.


[11] de Jong, S. Simpls: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18, 251–263 (1993).


[12] Paszke, A. et al. in Pytorch: An imperative style, highperformance deep learning library 8024–8035 (Curran Associates, Inc., 2019). URL http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.


[13] Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020). URL https://doi.org/10.1038/s41586-020-2649-2.


[14] Borggaard, C. & Thodberg, H. H. Optimal minimal neural interpretation of spectra. Analytical Chemistry 64, 545–551 (1992).


[15] Detrano, R. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. The American Journal of Cardiology 64, 304–310 (1989). URL https://www.sciencedirect.com/science/article/ pii/0002914989905249.


[16] Detrano, R. et al. Bayesian probability analysis: a prospective demonstration of its clinical utility in diagnosing coronary disease. Circulation 69, 541–547 (1984).


[17] Janosi, A., Steinbrunn, W., Pfisterer, M. & Detrano, R. Heart Disease Dataset (1988).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

Procrustes Technologies HackerNoon profile picture
Procrustes Technologies@procrustes
Procrustes' method aligns and adjusts, making data conform, with precision and control, in the realm of math and shape.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
Hackernoon
Bsky
X REMOVE AD