Listen to this story
Procrustes' method aligns and adjusts, making data conform, with precision and control, in the realm of math and shape.
Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.
Authors:
(1) Sergey Kucheryavskiy, Department of Chemistry and Bioscience, Aalborg University and a Corresponding author (svk@bio.aau.dk);
(2) Sergei Zhilin, CSort, LLC., Germana Titova st. 7, Barnaul, 656023, Russia and Contributing authors0 (szhilin@gmail.com).
Editor's note: This is Part 2 of 4 of a study detailing a new method for the augmentation of numeric and mixed datasets. Read the rest below.
This chapter presents the general idea behind the Procrustes cross-validation approach, provides the necessary mathematical background and describes two implementations of the approach in detail.
Let {X, Y} be two matrices of size I × J and I × M correspondingly, representing the original training set to be augmented. Columns of matrix X are predictors — variables measured or observed directly, for example, light absorbance at different wavelengths in the case of spectra, abundance of operational taxonomic units (OTU) in the case of 16S RNA sequencing, results of clinical tests for patients etc. If the original dataset has qualitative predictors, they should be replaced by corresponding dummy variables, which are combined with quantitative predictors to form X.
Columns of matrix Y are responses — variables whose values should be predicted. There are situations when matrix Y is obsolete, for example, in the case of developing of one class classifier, only matrix with predictors is needed.
To augment the data a specific model should be developed. It aims at capturing the variance-covariance structure of X, or its part, directly related to the variance of values in Y. This model will be referred to as the PV-model and denoted as M.
The selection of the methods for creating the PV-model depends on the objectives. Thus, if data augmentation is needed for developing a one class classifier, when the matrix with responses, Y, is obsolete, we suggest using singular value decomposition (SVD), which is efficient in capturing the whole variance-covariance structure of X.
In the case of augmenting data for discrimination models, there can be two solutions. One solution is to consider each class as an independent population and use SVD decomposition to capture the variance-covariance structure of X within each class. The second solution will be to create a matrix with dummy predictors, Y, where each column corresponds to a particular class, and employ PLS decomposition. More details about using SVD and PLS decompositions are provided in the next two sections.
A general definition of the approach can be formulated as follows.
Let us create a global PV-model M by fitting all data points from the training set {X, Y}. If this model is then applied to a new dataset, {X∗ , Y∗}, it will result in a set of outcomes R∗:
M(X∗ , Y∗ ) → R∗
The nature of the outcomes depends on the method used for developing of M, and can include, for example, predicted response values in the case of PLS decomposition and scores and distances in the case of SVD decomposition.
Using cross-validation with random splits enables the generation of a very large number of the PV-sets, which will hold the properties described above, but will be different from each other. Combining the generated PV-sets with the original training data will result in the augmented training set we are aiming at.
The next two sections describe the details of PV-set generation using SVD and PLS decompositions. More information about the approach can be found in [10].
In the case of truncated SVD with A latent variables, the PV-model M consists of:
• number of latent variables (singular vectors), A
• 1 × J row-vector for mean centering, m
• 1 × J row-vector for standardization (if required), s
• J × A matrix with right singular vectors, V
• 1 × A vector with corresponding singular values, σ.
The SVD decomposition of X is defined as:
Here, U is an I × A matrix with left singular vectors and Σ is an A × A diagonal matrix with singular values σ = {σ1, ..., σA}. Element E is an I × J matrix with residuals, which can be used to estimate the lack of fit both for individual data points and for the whole dataset. For the sake of simplicity we assume that columns of X are already mean centered and standardized.
The right singular vectors V act as the orthonormal basis of the A-dimensional latent variable space located inside the original J-dimensional variable space (A ≤ J). The product of the left singular vectors and the singular values, T = UΣ, forms scores — coordinates of data points from X being projected to the latent variable space.
The expression can also be written as:
Any new data point, x = {x1, ..., xJ } can be projected to the model M by mean centering and (if required) standardizing its values using m and s, and then projecting the resulting values to the latent variable space defined by the columns of V:
t = xV
The score vector t can be normalized to obtain the corresponding left singular vector, u:
The explained part of x can be computed as:
The residual part is:
And a squared Mahalanobis distance, h, between the projection of the point in the latent variable space and the origin:
The first distance, q, is a measure of lack of fit for the point, while the second distance, h, is a measure of extremeness of the point (as the majority of the data points will be located around the origin due to mean centering).
Hence we meet the general Procrustes rule requirement defined by Equation 4 with the outcomes R consisting of the two distances. Moreover, this rule holds true for the distances computed using any number of latent variables a = 1, ..., A.
The generation of the PV-set in the case when A < rank(X) is similar but requires an additional step to hold the rule defined by Equation 14, further details can be found in [10].
Repeating the generation using random cross-validation splits provides a large number of the unique PV-sets that can be used to augment the original training set. Therefore, the SVD based data augmentation procedure has two parameters:
• number of segments, K
• number of latent variables, A
As it will be shown in the results section, neither parameter significantly influences the quality of the augmented data. In the case of the number of latent variables, A, any number large enough for capturing the systematic variation in X, will work well. The SVD based PV-set generation does not suffer from overfitting, and optimization of A is not needed.
Partial least squares is a decomposition used for solving multiple linear regression problem in case when predictors (and responses if they are multiple) are collinear. The decomposition can be expressed in the following set of equations:
In the case of one response, matrix Y can be replaced by a column vector y containing I response values. Then the Equation 18 can be rewritten in the following compact form:
In contrast to SVD, the latent variables in PLS are oriented along directions in the column space of X, which give the largest covariance between the scores (columns of matrix T) and the values of y. Hence, by using PLS, we can prioritize the variance-covariance structure of X directly related to y.
The difference between the original and estimated response values can be defined as:
We can define the Procrustean rule for the regression as:
Taking into account the equations above, this equation can be simplified to:
or in the case of PLS:
Therefore, PV-set generation in the case of PLS has the same two parameters as the SVD-based procedure:
• number of segments, K
• number of latent variables, A
It must also be noted that the number of latent variables, A, in case of PLS must be limited (in contrast to the SVD based algorithm). If too many latent variables are selected, the variation in scalar values, ck/c, can be very large, which leads to noisy PV-sets. This happens because higher latent variables do not have any covariance with the response variable, y; hence, ck values vary chaotically. It is recommended to keep all latent variables with absolute values of ck/c within the [0, 2] range.
As it is shown in the experimental part, if this limit is introduced, the number of latent variables does not have a substantial influence on the PV-sets quality and does not require specific optimization. Same can be concluded about the number of segments, K.
This paper is available on arxiv under CC BY 4.0 DEED license.