Just over a week, most of you would have heard that Facebooks AI research team (FAIR) developed a , that converts code from high level like C++, Python, Java, Cobol into another language using ’ . The traditional approach had been to tokenize the source language and convert it into an which the transcompiler would use to translate to the target language of choice, based on handwritten rules that define the translations, such that abstract or the context is not lost. neural transcompiler programming language ‘unsupervised translation Abstract Syntax Tree (AST) Over the decades, there has been such significant advances in the neural language translation that these neural models tend to outperform hard-coded handwritten rules by significant margin, though the only constraint has been the availability of sufficient parallel Corpora. This has been addressed to greater extent by the ‘Unsupervised ML translation’ approach, wherein a , in different programming languages from was used to train the model by the Facebook’s Research team. large corpora of monolingual source code Github This comes as relief to many organizations especially in the insurance, government and banking sector who have had continue with the legacy applications with little room for enhancements or fine tuning because they were written in good old times by programmers who were really skilled in them — COBOL, Pascal, Fortran , to name a few. Although programming languages have evolved over time, porting from one code base to a more efficient or modern language like Java, Swift, Ruby, python is real pain as it requires expertise in both the source and the target languages. For example, there are reports that the Commonwealth Bank of Australia spent a whopping US$750 million and five years migrating its core software away from COBOL on a mainframe to a modern platform. are complex to implement, less flexible and interpretable. Facebook’s Transcoder comes as solution to the long withstanding problem. Rule based translations Principal components of the FAIR transcoder FAIR Transcoders are based on a architecture, comprised of , based on the `attention is all you need` papers. FAIR transcoders rely on a single model for encoder and decoder and is based on 3 principles: transformer an encoder and a decoder Masked Language model Pretraining Denoising auto-encoding Back Translation 1. Masked Language model pretraining The masked language pretraining model is based on the BERT paper, which trains the encoder to identify from the source code. The encoder is trained to understand the programming construct to identify missing tokens and reconstruct them, when certain tokens are masked. masked tokens The encoder thus learns which tokens and constructs appear often together and thereby learns the structure of the code, by learning an , which takes into account the . This mean that the token which often mean the same in different programming languages share the same or similar embedding in a high dimensional space, which makes it possible to identify masked tokens. The classification layer, which is the outmost layer of the encoder makes the prediction for the token at each position based on conditional probabilities. embedding space statistical co-occurrence or coexistence of tokens This is illustrated in the example from the research paper, where the keyword ‘if’ was reconstructed, even though it was masked. However this applies not only to language specific ‘keywords’ but any tokens from the source code. Denoising auto-encoding (DAE): In a similar way the decoder is made to do denoising auto encoding, which takes the input from the encoder and outputs a token at a time in an , which is looped back into the same decoder to make the prediction for the subsequent tokens. The decoder is presumably trained with in parallel with the encoder. autoregressive way ‘Denoising’ as the paper states help the decoder to identify within the source, pretty much similar to, how the encoder is trained to handle masked tokens. corrupted segments of code Part of the corruption is masking some tokens which can also be extended to scrambling of tokens or sub-tokens or even dropping tokens. So the corrupted code is fed to the encoder, with the decoder having to generate the recovered code, based on the corpora of monolingual source codes it has been trained on, generating output in the desired target language determined by the language token of choice(Java, python, C++..). "The DAE objective operates like a supervised machine translation algorithm, where the model is trained to predict a sequence of tokens, given a corrupted version of that sequence" Back Translation: Back Translation, which has proved to be very effective in the supervised machine learning methods, is an important component of the unsupervised machine learning setting and it looks at the efficacy of translation by using a coupled with a . ‘source-to-target’ model ‘target-to-source’ model Since there is no ‘ground truth’, the target-to-source model tries to first create its version of the the source language, which is a rather a ‘noisy’ version and the ‘source-to-target’ model tries to recreate the original source code, in parallel, using the output of the noisy translation. These two model train in parallel until , when we have perfect translation of the original source code. convergence Model architecture: As the paper states, the model uses a with 6 layers, 8 attention head with a dimensionality of 1024, using a single encoder and decoder for all programming languages. This simplifies the model to a large extent, reducing memory requirement with the most of the heavy lifting done during the process of learning the shared embedding’s. transformer architecture The models were implemented in and trained on 32 V100 GPUs. To further minimize memory requirements and speed up computations, operations were used. The ‘Transcoder’ was optimized using , with a of 10–4 ‘PyTorch’ float16 Adam optimizer learning rate “During pretraining, we alternate between batches of C++, Java, and Python, composed of 32 sequences of source code of 512 tokens. At training time, we alternate between the denoising auto-encoding and back-translation objectives, and use batches of around 6000 tokens” (excerpt from the research paper) The use of a common, universal tokenizer for all languages was found to be suboptimal because of the difference in language constructs across languages and hence language specific tokenizer had to be used, for instance, ' tokenizer for Java, tokenizer for C++ and the inbuilt for Python source code, to ensure that the tokenized sequences are valid representations of the language specific keywords, operators and constructs. 'javalang ‘clang’ tokenizer Cross-lingual token embedding space Below is a representation of the shared embedding space, which showcases the embedding for language specific keywords. As we can see, keywords that mean the same or are used in similar contexts, are very close to each other. 2 dimensional t-SNE Model training: The model was trained on more than open source GitHub repositories from the Github public dataset. From the millions of projects that were available the transcompiler was trained to , instead of the entire project (since this is what is eventually expected) as functions are more concise enough to fit into memory and also allows for simpler evaluation during unit tests. 2.8 million evaluate functions Model evaluation: The transcoder’s performance was evaluated on parallel functions extracted from ‘GeeksforGeeks’, which is an online platform hosting problems and their solutions in several programming languages. They used a new used metric — , to test whether the hypothesis functions generate the same output, given the same inputs, though there are other metric such as ‘reference match’ and ‘BLEU’ score. 852 computation accuracy The Transcoders generated multiple translations using decoding and only those hypothesis were considered, which had the highest log-probabilities. beam search Doing so, Facebook was able to infer that even though the best performing version of the Transcoder did not yield functions that were an exact replica of the reference code, its translations had high ‘computation accuracy’ Below is the result of unit test using the ‘computation accuracy’ metric with Beam 1. These scores are evaluated against baseline scores that are derived by experts using rule based systems. meaning the translation with highest computation accuracy is kept and others are discarded. This is done for speed efficiency, although the paper also mentions about types of beam searches. I mention only this to keep the clutter down. ‘Beam 1’ is greedy decoding Transcoder evaluation on sample code According to the authors the transcoders exhibited , as compared to rules based models, and were successfully capable of converting between data types, finding the equivalent data structure, methods and libraries between languages. superior translation capabilities For example in the below source code, where we have a , the source code, C++ function uses a (char *str) as an argument , which defined array level accesses within the code block and this is correctly interpreted by the transcoder as of type and uses the in the translated version. C++ -> Java translation character pointer ‘string’ string specific methods The same source code was tested again, but this time changing the argument names from in the source, which resulted in the transcoder evaluating the java equivalent to be a (char [] array), and consequently used the character specific methods in the translated version. This demonstrate significant advantage over rule based systems using statistical inference on human written code, which is also less error-prone. ‘str’ to ‘arr’ character array Far from perfect Though transcoders have proven to be than rule based systems, they are far from the perfect. The authors mention about cases where translations have failed mainly due to . Most of the compilation errors have been due to or functions, need for type casting etc. But this could probably be addressed given the model has inference on sufficient code base to understand difference in method calls in programming languages. Some example from the research paper, where translations have failed are shown below. far superior compilation errors overloaded methods Summary: What we have seen from the Facebook research (FAIR) is giant leap towards the machine code translation. Though this may be in its nascent stages, there is definitely huge potential, which can translate into billions of dollars in saved revenue, less human intervention and less susceptibility to errors. Though they may be far from perfect, for now, they can assist humans by doing most of the heavy lifting in code translation and who knows, we may have the perfect transcoder in the near future. AI Thanks for reading this article. I have tried to be as accurate as possible in my interpretation of the research paper. Looking forward to hearing from you. Cheers. References Unsupervised Translation of Programming Languages BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding It’s COBOL all the way down BLEU Score