3,961 reads

Anomaly Detection with Privileged Information—Part 2

by Dmitrii SmoliakovDecember 15th, 2023

Too Long; Didn't Read

Explore the intricacies of implementing Support Vector Data Description (SVDD) and its enhanced version, SVDD+, for anomaly detection. Understand how SVDD encapsulates datasets and learn how SVDD+ integrates privileged information in the training phase. Dive into the quadratic optimization process of SVDD+ with a step-by-step guide and insights into the underlying math. Compare SVDD+ with OneClassSVM for anomaly detection and discover the performance on the kdd99 dataset. Gain practical implementation insights using the cvxopt library and navigate the world of anomaly detection with and without privileged information.

Company Mentioned

featured image - Anomaly Detection with Privileged Information—Part 2

Before exploring the SVDD+ implementation, let us briefly revisit what SVDD (Support Vector Data Description) and SVDD+ entail.

Support Vector Data Description (SVDD) is a method for describing data by encapsulating the target dataset within a hypersphere, suitable for one-class classification or outlier detection. SVDD+, in its turn, is a novel approach that integrates privileged information into the traditional SVDD framework.

Unlike classical SVDD, which disregards privileged information often present in human learning, SVDD+ leverages it to optimize the training phase through the construction of a set of corrective functions.

Moving forward, we will look in detail at SVDD+ as a quadratic optimization, explore its implementation using the cvxopt library, and conduct a comparative analysis between SVDD+ and OneClassSVM.

Optimization problem

In the previous post, we discussed building an anomaly detection algorithm that could use special information available only during the training phase.

In the process, we show that we have to solve this optimization problem:

Alternatively, we could solve a dual problem:

Where K(xi, xj) is a dot product in some feature space. Which turns out to be a special case of quadratic optimization problem.

We reproduce the solution by running the following:

There are techniques designed particularly for this problem. However, during this, we will be using a more generic solver cvxopt — a free software package for convex optimization based on the Python programming language.

SVDD+ as Quadratic Optimization

Usually, quadratic optimization is written down in the following form:

P is a symmetric semi-positive definite matrix, so this doesn’t look like what we previously had. And will require an additional workaround.

Change of Variables

First of all, let's define a new x, which is a combination of variables α, δ . We are going to stack them together in the following form. First, let's define a new x, a combination of variables α, **δ . We are going to stack them together in the following form:
We could always reconstruct original values based on this new vector:

Matrix P

With this new variable x, we could introduce matrix P as:

K is a matrix with pairwise kernel function values on training set data. Kij = K(xi, xj) The same goes for K* privileged information.

Now, we can write down the sum in the optimization problem as matrix multiplication:

This is precisely what we have in the dual problem formulation.

Equality Constraints

Now, we have to deal with equality constraints. We know that the sum of all α is equal to, v let's rewrite it as matrix multiplication: Here, the first l elements are equal to 1, and the second l are equal to 0. This gives us the one off first l elements of x, which are equal to αi.

The second equality will be slightly trickier. If we find a product with a vector with the first l elements equal to zero and the second equal to one, we will get:

So, the final matrix formula for equal constraints:

Inequality Constraints

Now, we can describe inequality constraints. Notice that we could isolate α from x by multiplying to block matrix:

Here, we have a block matrix of size lx2l combined from the l by l Identity matrix and zeroes matrix.

To isolate δ will require slightly more effort. We will combine two identity matrices to get δ: Finally, combining it all together:

Using the cvxopt Library

Now we are going to use cvxopt library. Having matrices ordinary and privileged features combined with kernel function, we could prepare optimization problem with the following Python function:

def prepare_problem(self, X, Z, original_kernel, 
                                        privileged_kernel, nu, gamma)
    gamma = privileged_regularization
    C = 1.0 / len(X) / self.nu
    size = X.shape[0]
    kernel_x = features_kernel(X)
    kernel_z =privileged_kernel(Z)
    zeros_matrix = np.zeros_like(kernel_x)
    P = 2 * np.bmat([[kernel_x, zeros_matrix],
                     [zeros_matrix, 0.5*gamma*kernel_z]])
    P = matrix(P)
    q = matrix(list(np.diag(kernel_x)) + [0] * size)
    A = matrix([[1.]*size + [0.]*size, [1.] * size*2]).T
    b = matrix([1., 1.])
    G = np.bmat([[-np.eye(size), zeros_matrix],
                 [-np.eye(size), np.eye(size)],
                 [np.eye(size), -np.eye(size)]])
    G = matrix(G)
    G = sparse(G)
    h = matrix([0]*size*2 + [C]*size)
    optimization_problem = {'P': P, 'q': q, 'G': G,
                            'h': h, 'A': A, 'b': b}
    return optimization_problem

Implementation Example

You can find the full implementation in the following repository: https://github.com/sklef/ISVDD. To install the package run: pip install git+https://github.com/sklef/ISVDD.git

kdd99 Dataset

We will use the kdd99 dataset to show how this approach works. The dataset contains information about TCP connections for the nine weeks of work. The first seven weeks are used as a train set, and the other two are used as a test set. The distribution of the train set and the test set are different. The test set even contains attacks not presented in the train set.

To get data, we could use the following code:

from sklearn.datasets import fetch_kddcup99
X, y = fetch_kddcup99(return_X_y=True, as_frame=True)

Feature Types

There are three types of features in kdd99 dataset:

The first are generated directly from the TCP dump. It includes the protocol type, number of sent fragments, network service on the destination, etc.
The second are features proposed by domain experts and generated from the first ones.
The third are features generated from the history of connections in a two-second window. We will consider basic features as ordinary information and content and traffic features as privileged ones.

basic_features = ["duration", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent"]
content_features = ["hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
                    "su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files",
                    "num_outbound_cmds", "is_host_login", "is_guest_login"]
traffic_features = ["count", "serror_rate", "rerror_rate", "same_srv_rate", "diff_srv_rate",
                    "srv_count", "srv_serror_rate", "srv_rerror_rate", "srv_diff_host_rate"]


ordinary_features = X[basic_features]
privileged_features = X[traffic_features + content_features]

Before the experiment, we remove categorical features and normalize the rest by subtracting its mean value and dividing by its standard deviation plus an additional constant equal to 1e-5. This constant was added to prevent dividing by zero.

ordinary_features = features[basic_features].astype(float)
ordinary_features = (ordinary_features - ordinary_features.mean()) / (ordinary_features.std() + 1e-5)
privileged_features = features[traffic_features + content_features].astype(float)
privileged_features = (privileged_features - privileged_features.mean()) / (privileged_features.std() + 1e-5)

Comparing SVDD+ and OneClassSVM

Further on we will compare the work of SVDD+ with OneClassSVM. It’s possible to prove that using rbf_kernel will produce the same result for OneClassSVM and SVDD. So, by comparing with OneClassSVM, we are comparing with the original SVDD. Unfortunately, there are no good enough Python APIs for this algorithm.

Let us suppose that we have a train and a test set. We know which records correspond to an attack in the train set and want to determine which ones are malicious on the test set.

To do so, we take only normal records from the train set and fit One-Class SVM. After that, we apply this trained model to detect elements from the test set that differed from the normal ones from the train set.

As a result, we get an anomaly score, which expresses the algorithm’s confidence in its prediction. Then, we are going to calculate the area under precision-recall.

By using cross-validation, we will also get a standard deviation of metric estimation.

from sklearn.model_selection import KFold


def evaluate_parameters(model, ordinary_features,
                        labels, privilged_features=None):
    splitter = KFold(n_splits=10)
    all_scores = []
    for test, train in splitter.split(labels):
        train_labels = labels.iloc[train]
        normal_data_indices = train_labels[train_labels == b"normal."].index
        train_oridnary_features = ordinary_features.loc[normal_data_indices]
        if privilged_features is not None:
            train_privileged_features = privilged_features.loc[normal_data_indices]
            model.fit(train_oridnary_features, train_privileged_features)
        else:
            model.fit(train_oridnary_features)
        test_features = ordinary_features.iloc[test]
        predictions = model.decision_function(test_features)
        test_labels = labels.iloc[test] != b"normal."
        all_scores.append(average_precision_score(test_labels, -predictions))
    return np.mean(all_scores), np.std(all_scores)

We are going to use RBF kernels with different values of parameter v

We will fix parameter v, privileged information kernel, and regularization.

all_gammas = np.logspace(-7, 7, 30)
ordinary_average_precision_mean = np.zeros_like(all_gammas)
ordinary_average_precision_std = np.zeros_like(all_gammas)

for index, gamma in enumerate(all_gammas):
    average_precision, average_precision_std = evaluate_parameters(OneClassSVM(gamma=gamma), ordinary_features, labels)
    ordinary_average_precision_mean[index] = average_precision
    ordinary_average_precision_std[index] = average_precision_std


privileged_average_precision_mean = np.zeros_like(all_gammas)
privileged_average_precision_std = np.zeros_like(all_gammas)

for index, gamma in enumerate(all_gammas):
    priv_kernel = partial(rbf_kernel, gamma=1e-4)
    kernel = partial(rbf_kernel, gamma=gamma)
    model = ISVDD(0.1, kernel, priv_kernel, privileged_regularization=0.1, tol=0.0001, max_iter=100, silent=True)
    average_precision, average_precision_std = evaluate_parameters(model, ordinary_features, labels, privileged_features)
    privileged_average_precision_mean[index] = average_precision
    privileged_average_precision_std[index] = average_precision_std

Conclusion

Though the usage of privileged information leads to better performance on test data with the same feature space, its performance is still worse than using all features for the training and test phase. These results demonstrate that we can see privileged information as a special case of regularization.

In addition, we demonstrate performance in test data for different combinations of original features' kernel width and privileged features' kernel width. A wider original kernel width allows us to perform better with a narrower privileged kernel width, which can be explained as the regularization role of the privileged information.