Authors:
(1) Gaurav Kolhatkar, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]);
(2) Akshit Madan, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]);
(3) Nidhi Kowtal, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]);
(4) Satyajit Roy, SCTR’s Pune Institute of Computer Technology, Pune, India ([email protected]).
Saeki, Horai, and Enomoto’s groundbreaking work in 1989 [15] marked the beginning of the path to close the gap between natural language and programming languages. With the aid of this fundamental discovery, software development could be sped up by converting natural language specifications into code. This work served as a conceptual starting point but also provided a framework for further investigation.
Cohn’s important book ”User Stories Applied” from 2004 [13] became a cornerstone as the software development industry turned towards more agile approaches. User stories are critical to Agile development, and Cohn underlined the value of quickly turning them into code. User stories have since taken on a key role in Agile workflows, signaling a dramatic turn in the field.
The emergence of automated code generation tools accelerated the shift from user stories written in plain language to programming code. Notably, Pseudogen was presented by Oda et al. in 2015 [19]. Even though it wasn’t just concerned with text-to-pseudocode conversion, this work was a significant advancement in automating the conversion of code from descriptions written in plain language.
With the introduction of deep learning and neural networks, the conversion of text to pseudocode reached a turning point. In 2018, Devlin et al. unveiled BERT [1], a revolutionary approach for interpreting natural language. Code generation underwent a revolution thanks to BERT’s ability to record context, which made it possible to comprehend programming-related language more sophisticatedly.
By pre-training on data flow, GraphCodeBERT [2] expanded BERT’s coding capabilities in 2021. This groundbreaking method, which made a substantial advancement in the discipline, used deep bidirectional transformers to improve code representation and comprehension. By addressing the need for models that could comprehend both the code and the data flow within it, generated pseudocode was of higher quality.
A unified approach for program understanding and citation generation was proposed by Ahmad et al. in 2021 [3], signaling the convergence of various research strands. By providing a thorough understanding of program structures, this work aimed to improve code generation. It emphasized the possibility of improving software development through an understanding of the entire development cycle, from user stories to code generation.
The value of standardized datasets became clear as the field developed. By serving as benchmarks for text-to-code and code-to-pseudocode conversion methods, datasets like the MBPP Dataset [26] and the ASE15 Django Dataset [27] significantly contributed to the advancement of technology. Researchers have been able to rigorously assess and compare their models thanks to these datasets.
Models like Code2Text [8] and DeepPseudo [10] have become more popular recently. These models improve the precision and applicability of text-to-pseudocode conversion by combining pre-training, deep learning, and code feature extraction. They offer reliable solutions for this task and are at the cutting edge of technology. These models are made to comprehend both the structure of the programming context and the code, allowing for more precise pseudocode generation.
This paper is available on arxiv under CC 4.0 license.