This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;
(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;
(3) Cristina Improta, University of Naples Federico II, Naples, Italy;
(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;
(5) Roberto Natella, University of Naples Federico II, Naples, Italy.
In this paper, we addressed the automatic correctness of the code generated by AI code generators. We proposed a fully automated method, named ACCA, that uses symbolic execution to assess the correctness of security-oriented code without any human effort.
We used our method to evaluate the performance of four state-of-the-art code generators in the generation of offensive assembly from NL descriptions and compared the results with the human evaluation and different baseline solutions, including state-of-the-art output similarity metrics and the well-known ChatGPT.
Our experiments showed that ACCA provides results almost equal and is the most correlated assessment solution to human evaluation, which is considered the golden standard in the field. Moreover, the analysis of the computational cost revealed that the time to perform the assessment of every code snippet is ∼ 0.17s on average, which is lower than the average time required by human analysts to manually inspect the code, based on our experience.
[1] G. Yang, Y. Zhou, X. Chen, X. Zhang, T. Han, T. Chen, Exploitgen: Template-augmented exploit code generation based on codebert, Journal of Systems and Software 197 (2023) 111577.
[2] P. Liguori, E. Al-Hossami, V. Orbinato, R. Natella, S. Shaikh, D. Cotroneo, B. Cukic, Evil: exploiting software via natural language, in: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), IEEE, 2021, pp. 321–332.
[3] X. Ruan, Y. Yu, W. Ma, B. Cai, Prompt learning for developing software exploits, in: Proceedings of the 14th Asia-Pacific Symposium on Internetware, 2023, pp. 154–164.
[4] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, R. Karri, Asleep at the keyboard? assessing the security of github copilot’s code contributions, in: 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022, IEEE, 2022, pp. 754–768. doi:10.1109/SP46214.2022.9833571. URL https://doi.org/10.1109/SP46214.2022.9833571
[5] M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, J. C. Santos, An empirical study of code smells in transformer-based code generation techniques, in: 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM), IEEE, 2022, pp. 71– 82.
[6] C. Tony, M. Mutas, N. E. D. Ferreyra, R. Scandariato, Llmseceval: A dataset of natural language prompts for security evaluations, CoRR abs/2303.09384 (2023). arXiv:2303.09384, doi:10.48550/arXiv. 2303.09384. URL https://doi.org/10.48550/arXiv.2303.09384
[7] M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, K. Chang, Retrieval augmented code generation and summarization, in: M. Moens, X. Huang, L. Specia, S. W. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Association for Computational Linguistics, 2021, pp. 2719–2734. doi:10.18653/v1/2021. findings-emnlp.232. URL https://doi.org/10.18653/v1/2021.findings-emnlp.232
[8] L. Han, A. Smeaton, G. Jones, Translation quality assessment: A brief survey on manual and automatic methods, in: Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age, Association for Computational Linguistics, online, 2021, pp. 15–33. URL https://aclanthology.org/2021.motra-1.3
[9] N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, J. Penix, Using static analysis to find bugs, IEEE software 25 (5) (2008) 22–29.
[10] A. Bessey, K. Block, B. Chelf, A. Chou, B. Fulton, S. Hallem, C. HenriGros, A. Kamsky, S. McPeak, D. Engler, A few billion lines of code later: using static analysis to find bugs in the real world, Communications of the ACM 53 (2) (2010) 66–75.
[11] K. Liu, A. Koyuncu, D. Kim, T. F. Bissyand´e, Avatar: Fixing semantic bugs with fix patterns of static analysis violations, in: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2019, pp. 1–12.
[12] Pylint, https://www.pylint.org, accessed: 2023-07-19 (2023).
[13] E. Reiter, A. Belz, An investigation into the validity of some metrics for automatically evaluating natural language generation systems, Computational Linguistics 35 (4) (2009) 529–558.
[14] D. Shterionov, R. Superbo, P. Nagle, L. Casanellas, T. O’dowd, A. Way, Human versus automatic quality evaluation of nmt and pbsmt, Machine Translation 32 (3) (2018) 217–235.
[15] P. Liguori, C. Improta, R. Natella, B. Cukic, D. Cotroneo, Who evaluates the evaluators? on automatic metrics for assessing ai-based offensive code generators, Expert Systems with Applications 225 (2023) 120073. doi:https://doi.org/10.1016/j.eswa.2023.120073. URL https://www.sciencedirect.com/science/article/pii/ S0957417423005754
[16] M. Evtikhiev, E. Bogomolov, Y. Sokolov, T. Bryksin, Out of the bleu: how should we assess quality of the code generation models?, Journal of Systems and Software 203 (2023) 111741.
[17] R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, I. Finocchi, A survey of symbolic execution techniques, ACM Computing Surveys (CSUR) 51 (3) (2018) 1–39.
[18] NASM, Netwide Assembler (NASM) (2022). URL https://www.nasm.us
[19] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, et al., Sok:(state of) the art of war: Offensive techniques in binary analysis, in: 2016 IEEE symposium on security and privacy (SP), IEEE, 2016, pp. 138–157.
[20] L. De Moura, N. Bjørner, Z3: An efficient smt solver, in: Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, Springer-Verlag, Berlin, Heidelberg, 2008, p. 337–340.
[21] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473
[22] G. Neubig, M. Sperber, X. Wang, M. Felix, A. Matthews, S. Padmanabhan, Y. Qi, D. S. Sachan, P. Arthur, P. Godard, J. Hewitt, R. Riad, L. Wang, XNMT: the extensible neural machine translation toolkit, in: C. Cherry, G. Neubig (Eds.), Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, AMTA 2018, Boston, MA, USA, March 17-21, 2018 - Volume 1: Research Papers, Association for Machine Translation in the Americas, 2018, pp. 185–192. URL https://aclanthology.org/W18-1818/
[23] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980
[24] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert: A pre-trained model for programming and natural languages, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Vol. EMNLP 2020 of Findings of ACL, Association for Computational Linguistics, 2020, pp. 1536–1547. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://doi.org/10.18653/v1/2020.findings-emnlp.139
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). arXiv:1907.11692. URL http://arxiv.org/abs/1907.11692
[27] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large language models for code understanding and generation, arXiv preprint arXiv:2305.07922 (2023).
[28] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (2020) 140:1–140:67. URL http://jmlr.org/papers/v21/20-074.html
[29] W. U. Ahmad, S. Chakraborty, B. Ray, K. Chang, Unified pretraining for program understanding and generation, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-T¨ur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Association for Computational Linguistics, 2021, pp. 2655–2668. doi:10.18653/v1/2021.naacl-main.211. URL https://doi.org/10.18653/v1/2021.naacl-main.211
[30] Z. Li, X. Wang, A. Aw, E. S. Chng, H. Li, Named-entity tagging and domain adaptation for better customized translation, in: Proceedings of the seventh named entities workshop, 2018, pp. 41–46.
[31] M. Modrzejewski, M. Exel, B. Buschbeck, T.-L. Ha, A. Waibel, Incorporating external annotation to improve named entity translation in nmt, in: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 2020, pp. 45–51.
[32] P. Liguori, E. Al-Hossami, D. Cotroneo, R. Natella, B. Cukic, S. Shaikh, Shellcode IA32: A dataset for automatic shellcode generation, in: Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), Association for Computational Linguistics, Online, 2021, pp. 58–64. doi:10.18653/v1/2021.nlp4prog-1.7. URL https://aclanthology.org/2021.nlp4prog-1.7
[33] P. Liguori, E. Al-Hossami, D. Cotroneo, R. Natella, B. Cukic, S. Shaikh, Can we generate shellcodes via natural language? an empirical study, Automated Software Engineering 29 (1) (2022) 1–34.
[34] J. Foster, Sockets, Shellcode, Porting, and Coding: Reverse Engineering Exploits and Tool Coding for Security Professionals, Elsevier Science, 2005. URL https://books.google.it/books?id=ZNI5dvBSfZoC
[35] H. Megahed, Penetration Testing with Shellcode: Detect, exploit, and secure network-level and operating system vulnerabilities, Packt Publishing, 2018.
[36] Exploit-db, Exploit Database Shellcodes, https://www.exploit-db. com/shellcodes?platform=linux_x86/ (2023).
[37] Shell-storm, Shellcodes database for study cases, http://shell-storm. org/shellcode/ (2022).
[38] G. Yang, X. Chen, Y. Zhou, C. Yu, Dualsc: Automatic generation and summarization of shellcode via transformer and dual learning, in: IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022, IEEE, 2022, pp. 361–372.
[39] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, ACL, 2002, pp. 311–318. doi:10.3115/ 1073083.1073135. URL https://aclanthology.org/P02-1040/
[40] NLTK, Natural Language Toolkit (NLTK), bleu score module (2023). URL https://www.nltk.org/api/nltk.translate.bleu_score. html
[41] M. Post, A call for clarity in reporting BLEU scores, in: Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics, Belgium, Brussels, 2018, pp. 186–191. URL https://www.aclweb.org/anthology/W18-6319
[42] SacreBLEU, https://huggingface.co/spaces/evaluate-metric/ sacrebleu, accessed: 2023-10-01 (2023).
[43] pylcs, Python library pylcs (2023). URL https://pypi.org/project/pylcs/
[44] ChatGPT, https://chat.openai.com/chat, accessed: 2023-10-01 (2023).
[45] D. Kim, T. MacKinnon, Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks, Clinical radiology 73 (5) (2018) 439–445.
[46] E. Mashhadi, H. Hemmati, Applying codebert for automated program repair of java simple bugs, in: 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021, IEEE, 2021, pp. 505–509. doi:10.1109/MSR52588.2021. 00063. URL https://doi.org/10.1109/MSR52588.2021.00063
[47] K. Pearson, Notes on regression and inheritance in the case of two parents proceedings of the royal society of london, 58, 240-242, K Pearson (1895).
[48] H. Akoglu, User’s guide to correlation coefficients, Turkish journal of emergency medicine 18 (3) (2018) 91–93.
[49] D. Insa, J. Silva, Automatic assessment of java code, Computer Languages, Systems & Structures 53 (2018) 59–72.
[50] D. Insa, J. Silva, Semi-automatic assessment of unrestrained java code: a library, a dsl, and a workbench to assess exams and exercises, in: Proceedings of the 2015 ACM conference on innovation and technology in computer science education, 2015, pp. 39–44.
[51] R. Romli, S. Sulaiman, K. Z. Zamli, Test data generation framework for automatic programming assessment, in: 2014 8th. Malaysian Software Engineering Conference (MySEC), 2014, pp. 84–89. doi:10.1109/ MySec.2014.6985993.
[52] S. Li, X. Xiao, B. Bassett, T. Xie, N. Tillmann, Measuring code behavioral similarity for programming and software engineering education, in: Proceedings of the 38th International Conference on Software Engineering Companion, ICSE ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 501–510. doi:10.1145/2889160.2889204. URL https://doi.org/10.1145/2889160.2889204
[53] S. M. Arifi, A. Zahi, R. Benabbou, Semantic similarity-based evaluation for c programs through the use of symbolic execution, in: 2016 IEEE Global Engineering Education Conference (EDUCON), 2016, pp. 826– 833. doi:10.1109/EDUCON.2016.7474648.
[54] J. P. Lim, S. Nagarakatte, Automatic equivalence checking for assembly implementations of cryptography libraries, in: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, 2019, pp. 37–49.
[55] L. Luo, J. Ming, D. Wu, P. Liu, S. Zhu, Semantics-based obfuscationresilient binary code similarity comparison with applications to software and algorithm plagiarism detection, IEEE Transactions on Software Engineering 43 (12) (2017) 1157–1177. doi:10.1109/TSE.2017.2655046.
[56] D. Gao, M. K. Reiter, D. Song, Binhunt: Automatically finding semantic differences in binary programs, in: Information and Communications Security: 10th International Conference, ICICS 2008 Birmingham, UK, October 20-22, 2008 Proceedings 10, Springer, 2008, pp. 238–255.
[57] S. Ullah, H. Oh, Bindiffnn: Learning distributed representation of assembly for robust binary diffing against semantic differences, IEEE Transactions on Software Engineering 48 (9) (2022) 3442–3466. doi: 10.1109/TSE.2021.3093926.
[58] C. B´era, E. Miranda, M. Denker, S. Ducasse, Practical validation of bytecode to bytecode jit compiler dynamic deoptimization, The Journal of Object Technology 15 (2) (2016) 1–1.
[59] C. Hawblitzel, S. K. Lahiri, K. Pawar, H. Hashmi, S. Gokbulut, L. Fernando, D. Detlefs, S. Wadsworth, Will you still compile me tomorrow? static cross-version compiler validation, in: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, 2013, pp. 191– 201.
[60] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, S. Ma, Codebleu: a method for automatic evaluation of code synthesis, CoRR abs/2009.10297 (2020). arXiv:2009. 10297. URL https://arxiv.org/abs/2009.10297
[61] N. M. Tran, H. Tran, S. Nguyen, H. Nguyen, T. N. Nguyen, Does BLEU score work for code migration?, in: Y. Gu´eh´eneuc, F. Khomh, F. Sarro (Eds.), Proceedings of the 27th International Conference on Program Comprehension, ICPC 2019, Montreal, QC, Canada, May 25-31, 2019, IEEE / ACM, 2019, pp. 165–176. doi:10.1109/ICPC.2019.00034. URL https://doi.org/10.1109/ICPC.2019.00034
[62] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, P. S. Liang, Spoc: Search-based pseudocode to code, Advances in Neural Information Processing Systems 32 (2019).
[63] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. W. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, I. Babuschkin, S. A. Balaji, S. Jain, A. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, W. Zaremba, Evaluating large language models trained on code, ArXiv abs/2107.03374 (2021).
[64] T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, D. Brumley, Automatic exploit generation, Communications of the ACM 57 (2) (2014) 74–84.
[65] D. Xu, K. Chen, M. Lin, C. Lin, X. Wang, Autopwn: Artifact-assisted heap exploit generation for ctf pwn competitions, IEEE Transactions on Information Forensics and Security (2023).
[66] M. Botacin, Gpthreats-3: Is automatic malware generation a threat?, in: 2023 IEEE Security and Privacy Workshops (SPW), IEEE, 2023, pp. 238–254.
[67] Y. M. Pa Pa, S. Tanizaki, T. Kou, M. Van Eeten, K. Yoshioka, T. Matsumoto, An attacker’s dream? exploring the capabilities of chatgpt for developing malware, in: Proceedings of the 16th Cyber Security Experimentation and Test Workshop, 2023, pp. 10–18.
[68] M. Gupta, C. Akiri, K. Aryal, E. Parker, L. Praharaj, From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy, IEEE Access (2023).