Authors:
(1) Gaurav Kolhatkar, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]);
(2) Akshit Madan, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]);
(3) Nidhi Kowtal, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]);
(4) Satyajit Roy, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]).
This paper examines the enormous potential for programming work automation offered by natural language processing. We propose a two-stage methodology to convert English language user stories into pseudocode. The advantage of pseudocode is that it allows easy conversion into any programming language of the developer’s choice. The two stages of this approach are text to code conversion and code to pseudocode conversion. Each of these stages is treated as a language translation task. We use the CodeT5 model for this task, getting a BLEU score of 0.4 for Stage 1 and 0.74 for Stage 2. Our proposed system simplifies the software development process in organizations.
In the future, we plan to curate a larger dataset for English text to Python code conversion. We can obtain higher accuracy and better generalisation to new examples with a larger and more varied dataset. This would necessitate considerable data gathering, but it might open the door to more efficient and useful text to code conversion methods. Additionally, a pertinent text to pseudocode dataset might be produced to help simplify the conversion architecture.
[1] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[2] GraphCodeBERT: Pre-training Code Representations with Data Flow ICLR 2021 · Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, Ming Zhou
[3] Unified Pre-training for Program Understanding and Generation Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Cha (2021)
[4] Code Generation from Natural Language with Less Prior Knowledge and More Monolingual Data Sajad Norouzi, Keyi Tang, Yanshuai Cao (2021)
[5] Retrieval Augmented Code Generation and Summarization Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, KaiWei Chang (2021)
[6] Fine-grained Pseudo-code Generation Method via Code Feature Extraction and Transformer Guang Yang, Yanlin Zhou, Xiang Chen, Chi Yu (2021)
[7] Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura (2021)
[8] Code2Text: Dual Attention Syntax Annotation Networks for StructureAware Code Translation Authors : Yun Xiong, Shaofeng Xu, Keyao Rong, Xinyue Liu, Xiangnan Kong, Shanshan Li, Philip Yu, Yangyong Zhu
[9] Home Database Systems for Advanced Applications Conference paper Code2Text: Dual Attention Syntax Annotation Networks for StructureAware Code Translation Yun Xiong, Shaofeng Xu, Keyao Rong, Xinyue Liu, Xiangnan Kong, Shanshan Li, Philip Yu & Yangyong Zhu
[10] DeepPseudo: Deep Pseudo-code Generation via Transformer and Code Feature Extraction Guang Yang, Xiang Chen, +1 author Chi Yu (2021)
[11] Dense Passage Retrieval for Open-Domain Question Answering Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell ˘ Wu, Sergey Edunov, Danqi Chen, Wen-tau (2020)
[12] Shah, V.: The implication of agile traditional method as a practice in pharmaceutical industry (2017)
[13] Cohn, M.: User Stories Applied: For Agile Software Development. Addison- Wesley Professional (2004)
[14] Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Mapping language to code in programmatic context. arXiv preprint arXiv:1808.09588 (2018)
[15] Saeki, M., Horai, H., Enomoto, H.: Software development process from natural language specification. In: Proceedings of the 11th International Conference on Software Engineering, pp. 6473 (1989)
[16] Gilson, F., Galster, M., Georis, F.: Extracting quality attributes from user stories for early architecture decision making. In: 2019 IEEE International Conference on Software Architecture Companion (ICSAC), pp. 129136 (2019). IEEE
[17] Robeer, M., Lucassen, G., Van DerWerf, J.M.E., Dalpiaz, F., Brinkkemper, S.: Automated extraction of conceptual models from user stories via nlp. In: 2016 IEEE 24th International Requirements Engineering Conference (RE), pp. 196–205 (2016). IEEE
[18] Husain, H.,Wu, H.-H., Gazit, T., Allamanis, M., Brockschmidt, M.: Code- SearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)
[19] Fudaba, H., Oda, Y., Akabe, K., Neubig, G., Hata, H., Sakti, S., Toda, T., Nakamura, S.: Pseudogen: A tool to automatically generate pseudo-code from source code. In: 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 824{829 (2015). IEEE
[20] Alhefdhi, A., Dam, H.K., Hata, H., Ghose, A.: Generating pseudocode from source code using deep learning. In: 2018 25th Australasian Software Engineering Conference (ASWEC), pp. 21{25 (2018). IEEE
[21] Rai, S., Gupta, A.: Generation of pseudo code from the python source code using rule-based machine translation. arXiv preprint arXiv:1906.06117 (2019)
[22] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104{3112 (2014)
[23] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[24] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[25] Dataset for text to code generation
[26] Dataset for code to pseudo-code generation
[27] BLEU: a Method for Automatic Evaluation of Machine Translation Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zh
This paper is available on arxiv under CC 4.0 license.