This story draft by @escholar has not been reviewed by an editor, YET.


EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


(1) Maria Rigaki, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic and [email protected];

(2) Sebastian Garcia, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic and [email protected].

Table of Links

Abstract & Introduction

Threat Model

Background and Related Work


Experiments Setup



Conclusion, Acknowledgments, and References


7 Discussion

7.1 Performance Considerations

The results showed that it is possible to learn a reinforcement learning policy that evades machine learning classifiers and AVs with limited queries and that the number of modifications required is lower for ”easier” targets. However, it must be noted that different frameworks use different implementations of the modifications and slightly different action sets, which may play some role in the results. It may explain why the random agent outscored the SOTA frameworks in two targets. However, the fact that the random agent managed to achieve an evasion rate of 40% on the AV target shows that sometimes deployed products have simple rules and heuristics that can be bypassed by making random changes to a malicious binary.

The created surrogates showed a very high-label agreement with very few queries, but they required an auxiliary dataset. However, obtaining an auxiliary dataset of malicious and benign features is more straightforward than obtaining actual files, especially benign ones. An interesting result is that even with an auxiliary dataset such as Ember, which is almost five years old, it was possible to evade the AV with a high evasion rate.

7.2 The Advantage of Learning a Policy

MEME and PPO use 2,048 queries for training the policy and require additional queries during evaluation. MAB and GAMMA directly act on the binaries without separate training and testing phases. While this may seem advantageous, a trained policy can be applied to any binary, and its generalization abilities are shown using a previously unseen test set. Moreover, the attacker controls the environment and can apply multiple actions using the learned policy, bypassing the target entirely or using the surrogate. Then they can test the final modified binary on the target, achieving the highest query efficiency one query per binary.

7.3 Future Work

To extend this work, we can explore improvements and optimizations. One possibility is using an ensemble of surrogates, similar to [21], consisting of diverse model types and architectures. This can enhance the evasion rate and feature explainability, but it adds complexity and training time.

Another avenue is investigating recurrent PPO or similar algorithms, leveraging recurrent neural networks to learn policies that generate action sequences from states. Any query-efficient method, even non-RL-based, that takes malicious binaries or their extracted features as input and produces modification actions can be explored.

Expanding the targets to include more AVs would help testing different approaches in malware detection. Additionally, while surrogate models improved MEME’s performance, they can provide more target information. Future work involves utilizing surrogates to reduce the RL algorithm’s action space based on feature importance or gradient information from neural network surrogates.

. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community