Authors:
(1) S M Rakib Hasan, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh (sm.rakib.hasan@g.bracu.ac.bd);
(2) Aakar Dhakal, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh (aakar.dhakal@g.bracu.ac.bd). Authors: Authors: (1) S M Rakib Hasan, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh (sm.rakib.hasan@g.bracu.ac.bd); Department of Computer Science and Engineering, BRAC University (2) Aakar Dhakal, Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh (aakar.dhakal@g.bracu.ac.bd). Department of Computer Science and Engineering, BRAC University, Table of Links Abstract and I. Introduction Abstract and I. Introduction II. Literature Review II. Literature Review III. Methodology III. Methodology IV. Results and Discussion IV. Results and Discussion V. Conclusion and Future Work, and References V. Conclusion and Future Work, and References IV. RESULTS AND DISCUSSION From our experiments, we have achieved outstanding results on our malware detection system. A. Binary Classification A. Binary Classification Our trained model achieved 99.99% accuracy on the test set, detecting all the malware correctly. The result is shown in Fig.II However, all the models performed very well in the detection of potential malware. The results are tabulated in TABLE II B. Malware Classification B. Malware Classification As the dataset is highly imbalanced, we conducted this part in 3 steps. First, we conducted the experiment on the original dataset, then undersampled the majority class and later oversampled the minority classes. 1) Classification on Original Data: Here, we ran the untouched data through our chosen algorithms and achieved moderate results. Although the metrics are not as impressive as the binary classification, it is mentionable that, no malware was classified safe, rather, different malwares were classified wrong. Our result is tabulated in TABLE III. From the results, it is seen that the XGBoost classifier performed the best in the detection and classification of malware. 1) Classification on Original Data: 2) Undersampling Majority Class: We have used four types of undersampling methods and trained our models on all of them. We got different performance metrics for different undersampling methods. No single method could dominate the scores. However, Random Undersampling and Near Miss approaches performed better than the other two methods. These results are tabulated in TABLE IV. From the results, we can see, that the XGBoost Classifier also performed better in this case while the Random Forest Classifier was really close. In this approach too, no malware was labeled safe during detection. 2) Undersampling Majority Class: 3) Oversampling Minority Classes: Among the popular oversampling methods, we choose ADASYN(Adaptive Synthetic Sampling). It is a data augmentation technique primarily used in imbalanced classification tasks. After applying ADASYN to all the minority classes separately, we balanced the dataset and applied our chosen classification algorithms. We got our best results with this approach. The findings are tabulated in TABLE V 3) Oversampling Minority Classes: Here also, XGBoost outperformed the other classifiers and provided the best predictions. The detection is shown in the Fig.3 Therefore, we see that our malware detection models are well-performing and robust. It can perfectly detect any potential malware through memory dump analysis as we conduct binary classification. In classifying the malware, among the explored approaches, the application of ADASYN emerged as the most promising solution. By systematically addressing the class imbalance through synthetic data generation, we achieved superior results compared to both the original format classification and the undersampling techniques. The outcomes of our experiments underscore the importance of tailored strategies for handling class imbalance and reaffirm the potential of advanced techniques like ADASYN in enhancing multiclass classification accuracy. V. CONCLUSION AND FUTURE WORK In conclusion, our research addresses the rising threat of obfuscated malware in connected devices and the internet landscape. Through memory dump analysis and diverse machine learning algorithms, we’ve explored effective detection strategies and illuminated their strengths and limitations using the CIC-MalMem-2022 dataset. Emphasizing the synergy between machine learning and traditional security methods, our work underscores the need for a comprehensive defense strategy in the dynamic cybersecurity realm. While acknowledging the ever-evolving malware landscape, our research lays the groundwork for future endeavours, advocating continuous adaptation. Future efforts should focus on refining algorithms, exploring new data sources, and fostering interdisciplinary collaboration. We envision research on hybrid approaches, combining machine learning and signature-based methods, and studying the impact of adversarial attacks and explainable AI to enhance detection system robustness and transparency. In summary, our study provides valuable insights for resilient cybersecurity solutions, addressing the challenges of obfuscated malware and advancing detection capabilities to safeguard digital ecosystems against emerging threats. REFERENCES [1] Z. Chen, E. Brophy, and T. Ward, “Malware classification using static disassembly and machine learning,” arXiv preprint arXiv:2201.07649, 2021. [2] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel feature extraction, selection and fusion for effective malware family classification,” in Proceedings of the sixth ACM conference on data and application security and privacy, 2016, pp. 183–194. [3] I. You and K. Yim, “Malware obfuscation techniques: A brief survey,” in 2010 International conference on broadband, wireless computing, communication and applications. IEEE, 2010, pp. 297–300. [4] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A multimodal deep learning method for android malware detection using various features,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 3, pp. 773–788, 2018. [5] A. Bacci, A. Bartoli, F. Martinelli, E. Medvet, F. Mercaldo, C. A. Visaggio et al., “Impact of code obfuscation on android malware detection based on static and dynamic analysis.” in ICISSP, 2018, pp. 379–385. [6] O. A. Aslan and R. Samet, “A comprehensive review on malware ¨ detection approaches,” IEEE access, vol. 8, pp. 6249–6271, 2020. [7] G. Wagener, R. State, and A. Dulaunoy, “Malware behaviour analysis,” Journal in Computer Virology, vol. 4, pp. 279–287, 11 2008. [8] Y. Fukushima, A. Sakai, Y. Hori, and K. Sakurai, “A behavior based malware detection scheme for avoiding false positive,” 11 2010, pp. 79 – 84. [9] M. Chandramohan, H. B. K. Tan, L. C. Briand, L. K. Shar, and B. M. Padmanabhuni, “A scalable approach for malware detection through bounded feature space behavior modeling,” in Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, November 2013, pp. 312– 322. [10] T. Carrier, P. Victor, A. Tekeoglu, and A. H. Lashkari, “Detecting obfuscated malware using memory feature engineering,” in The 8th International Conference on Information Systems Security and Privacy (ICISSP), 2022. [11] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. [12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. [13] T. M. Cover and P. E. Hart, “Nearest-neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967. [14] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2016, pp. 785–794. [15] R. Alejo, J. M. Sotoca, R. M. Valdovinos, and P. Toribio, “Edited nearest neighbor rule for improving neural networks classifications,” in Advances in Neural Networks - ISNN 2010, L. Zhang, B.-L. Lu, and J. Kwok, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 303–310. [16] C. Jiang, J. Song, G. Liu, L. Zheng, and W. Luan, “Credit card fraud detection: A novel approach using aggregation strategy and feedback mechanism,” IEEE Internet of Things Journal, pp. 1–1, 2018. [17] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1322–1328. This paper is available on arxiv under CC BY-SA 4.0 DEED license. This paper is available on arxiv under CC BY-SA 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Abstract

Forest

Near

Mal-Where? How We Boosted Malware Detection to XG-ceptional Levels

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

5 Key Metrics to Evaluate Few-Shot Remote Sensing Models

Hide and Seek in Memory: Outsmarting Sneaky Malware with Data Magic

Malware Mayhem: Outsmarting the Digital Chameleon

Dumping Data & Dodging Danger: A Quirky Quest Against Obfuscated Malware

Windows Sticky Keys Exploit: The War Veteran That Never Dies

06/02/2018: Biggest Stories in the Cryptosphere

5 Key Metrics to Evaluate Few-Shot Remote Sensing Models

Hide and Seek in Memory: Outsmarting Sneaky Malware with Data Magic

Malware Mayhem: Outsmarting the Digital Chameleon

Dumping Data & Dodging Danger: A Quirky Quest Against Obfuscated Malware

Windows Sticky Keys Exploit: The War Veteran That Never Dies

06/02/2018: Biggest Stories in the Cryptosphere

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps