MEME Algorithm: Optimizing Malware Evasion Through Model Extraction and Reinforcement Learning

Authors:

(1) Maria Rigaki, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic and [email protected];

(2) Sebastian Garcia, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic and [email protected].

Table of Links

Abstract & Introduction

Threat Model

Background and Related Work

Conclusion, Acknowledgments, and References

Appendix

4 Methodology

The MEME algorithm combines model extraction and reinforcement learning (RL) techniques to improve the generation of evasive malware binaries. The algorithm takes advantage of the fact that the attacker fully controls the binary modifier part of the environment but does not control the target model, which is a black box. However, an attacker can use data collected during the training of the RL agent to create and train a surrogate model of the target. The algorithm is implemented as several training/testing rounds until a final reinforcement learning policy is learned that can produce adversarial binaries for a given target. Figure 2 presents a high-level description of the algorithm, and the detailed listing is presented in Algorithm 1.

Following Algorithm 1, MEME initializes a Proximal Policy Optimization (PPO) RL algorithm [35 (lines 1 and 2) to perform a first train of the PPO policy using the target model and a limited amount of steps (using MalwareGym). The goal of this step is to store a small number of observations and labels in Dsur (line 3). Then, Dsur is combined with an auxiliary set of labeled data (Daux) that the attacker should possess, and this combined dataset is used to train a surrogate model for the model extraction attack (line 5). The improved surrogate replaces the target model in a new Malware-Gym environment that trains a better policy (agent) for evasion (line 6). Last, the improved policy is evaluated against the original target model (line 7) to obtain the final evasion metrics. In this last step, we take advantage of the queries done to the target to append their output to the Dsur dataset.

This loop is repeated for k rounds. During the last round, the evaluation is performed using the malware binary test set to produce the final evasion metrics (instead of the malware binary evaluation set used in the inner loop). The total number of queries to the target during training is k ∗ n, where k is the total number of rounds.

For learning the policy, we use the PPO algorithm, which is a model-free onpolicy gradient method that clips the objective function, limiting the updates done to the policy to avoid too large positive changes or a minimum of negative changes in order to stabilize it. We used the Stable Baselines3 [31] implementation of PPO that uses an Actor-Critic (A2C) architecture.

To train the surrogate model, it is not enough to use the data collected during the learning or evaluation of the agent in Dsur. The number of samples is not enough since we aim to minimize the queries to the target. In addition, all observations come from the extracted features of the malicious binary dataset or their adversarial modifications. Therefore using an auxiliary dataset is necessary to learn a good surrogate. This dataset is assumed to be from a similar distribution as the training data distribution of the target, however, this is not easy to achieve when the target is an AV or an unknown system.

Adaptations to the Gym Environment MEME was implemented around the Malware-Gym environment (Section 3.1) in its latest version1. Several modifications were applied to the environment to fit the assumptions and constraints for this work:

– The ”modify machine type” action was removed because our tests showed that it produces invalid binaries for Windows 10 systems.

– All targets were set to return hard labels (0 or 1), not scores.

– Regarding the benign sections, the Malware-Gym implementation used only data from ”.text” sections. In our implementation, we use data from other sections if the ”.text” section is unavailable.

– The latest environment supports the Ember and Sorel-LGB classifiers as targets. We added new environments to support the Sorel-FFNN, surrogate, and AV targets. The AV requires a web service in a virtual machine that invokes the AV static scanning capabilities.

– For all target environments, we added support for saving the observations (features) and scores during training and evaluation runs so that they can be used for the training of the surrogate.

Our version of the environment and the experiments performed as part of this work can be found in (https://github.com/stratosphereips/meme malware rl/releases/tag/v1.0). Please note that we are releasing the source code of our implementation for reproducibility and improvement; however, we do not release the trained models or agents to avoid potential misuse.

4.1 Evasion Evaluation

To evaluate the performance of the evasion models reasonably and realistically, we limited how many queries they do to the target model and how long they run. The idea behind liming running time is that the ratio at which malware authors create new malware is high, and a method that takes too long to create evasive malware is impractical. Industry measurements such as the ones provided by AV-ATLAS [18] report that 180 new malware are generated per minute. Virus Total [41] provides an even higher measurement of 560 distinct new files uploaded per minute as of the 21st of May, 2023. These global measurements vary but show how quickly new malware variants appear, possibly due to the proliferation of Malware-as-a-Service frameworks. A recent and more conservative measurement from Blackberry mentions that their customers get attacked by 1.5 new malware per minute [39]. Given the above measurements, we decided to constrain the running time of each experiment to four hours in total. Given that the test set consists of 300 malware binaries, this corresponds to 1.25 processed binaries per minute which is lower than the most conservative reported metric we could find.

The primary metric used to evaluate the malware evasion task was the evasion rate E, which is the fraction of malware that becomes evasive within the time window of four hours over the total number of malware that were initially detected by each target: E = nev/ndet .

We also report the average number of binary modifications required for a malware binary to evade the target. For the Random, PPO, and MEME methods, this is equivalent to the mean episode length over all the detected malware binaries in the test set. The episode length for a non-evasive binary is equal to the maximum number of attempts, and for an evasive one is the number of changes required to become evasive. For the MAB framework, for the evasive binaries, we considered only the minimal binaries with the least amount of actions. The average number of binary modifications was impossible to measure for the GAMMA attack since the SecML framework (through which GAMMA was implemented) does not provide a straightforward way to measure this metric.

4.2 Surrogate Evaluation

The surrogate models trained in MEME were evaluated using two different metrics: label agreement and explainability-based feature agreement [37]. The label agreement of two models f and ˆf, respectively, is defined as the average number of similar predictions over a test set Xtest, and it is a standard metric for model extraction fidelity attacks [19]:

The feature agreement metric computes the fraction of common features between the sets of top-k features of two explanations. Given two explanations Et (target) and Es (surrogate), the feature agreement metric can be formally defined as:

where top features(E, k) returns the set of top-k features of the explanation E based on the magnitude of the feature importance values. The maximum value of the feature agreement is 1. For this work, we measured the feature agreement for k = 10 and k = 20, and the explainability method used was SHAP (SHapley Additive exPlanations) [25]. SHAP employs game theory principles to explain machine learning model outputs by linking effective credit allocation with localized explanations via Shapley values that originated from game theory. SHAP is model-agnostic and has been used effectively in the past for adversarial malware creation [33]. For the LightGBM targets we used TreeSHAP which is designed specifically for tree models, while for Sorel FFNN we used the KernelSHAP variant.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.