A Deep Dive Into Facebook’s AI Transcoder

Just over a week, most of you would have heard that Facebooks AI research team (FAIR) developed a neural transcompiler, that converts code from high level programming language like C++, Python, Java, Cobol into another language using ‘unsupervised translation’ . The traditional approach had been to tokenize the source language and convert it into an Abstract Syntax Tree (AST) which the transcompiler would use to translate to the target language of choice, based on handwritten rules that define the translations, such that abstract or the context is not lost.

Over the decades, there has been such significant advances in the neural language translation that these neural models tend to outperform hard-coded handwritten rules by significant margin, though the only constraint has been the availability of sufficient parallel Corpora.

This has been addressed to greater extent by the ‘Unsupervised ML translation’ approach, wherein a large corpora of monolingual source code, in different programming languages from Github was used to train the model by the Facebook’s Research team.

This comes as relief to many organizations especially in the insurance, government and banking sector who have had continue with the legacy applications with little room for enhancements or fine tuning because they were written in good old times by programmers who were really skilled in them — COBOL, Pascal, Fortran , to name a few.

Although programming languages have evolved over time, porting from one code base to a more efficient or modern language like Java, Swift, Ruby, python is real pain as it requires expertise in both the source and the target languages. For example, there are reports that the Commonwealth Bank of Australia spent a whopping US$750 million and five years migrating its core software away from COBOL on a mainframe to a modern platform.

Rule based translations are complex to implement, less flexible and interpretable. Facebook’s Transcoder comes as solution to the long withstanding problem.

Principal components of the FAIR transcoder

FAIR Transcoders are based on a transformer architecture, comprised of an encoder and a decoder, based on the `attention is all you need` papers. FAIR transcoders rely on a single model for encoder and decoder and is based on 3 principles:

Masked Language model Pretraining
Denoising auto-encoding
Back Translation

1. Masked Language model pretraining

The masked language pretraining model is based on the BERT paper, which trains the encoder to identify masked tokens from the source code. The encoder is trained to understand the programming construct to identify missing tokens and reconstruct them, when certain tokens are masked.

The encoder thus learns which tokens and constructs appear often together and thereby learns the structure of the code, by learning an embedding space, which takes into account the statistical co-occurrence or coexistence of tokens. This mean that the token which often mean the same in different programming languages share the same or similar embedding in a high dimensional space, which makes it possible to identify masked tokens. The classification layer, which is the outmost layer of the encoder makes the prediction for the token at each position based on conditional probabilities.

This is illustrated in the example from the research paper, where the keyword ‘if’ was reconstructed, even though it was masked. However this applies not only to language specific ‘keywords’ but any tokens from the source code.

Denoising auto-encoding (DAE):

In a similar way the decoder is made to do denoising auto encoding, which takes the input from the encoder and outputs a token at a time in an autoregressive way, which is looped back into the same decoder to make the prediction for the subsequent tokens. The decoder is presumably trained with in parallel with the encoder.

‘Denoising’ as the paper states help the decoder to identify corrupted segments of code within the source, pretty much similar to, how the encoder is trained to handle masked tokens.

Part of the corruption is masking some tokens which can also be extended to scrambling of tokens or sub-tokens or even dropping tokens. So the corrupted code is fed to the encoder, with the decoder having to generate the recovered code, based on the corpora of monolingual source codes it has been trained on, generating output in the desired target language determined by the language token of choice(Java, python, C++..).

"The DAE objective operates like a supervised machine translation algorithm, where the model is trained to predict a sequence of tokens, given a corrupted version of that sequence"

Back Translation:

Back Translation, which has proved to be very effective in the supervised machine learning methods, is an important component of the unsupervised machine learning setting and it looks at the efficacy of translation by using a ‘source-to-target’ model coupled with a ‘target-to-source’ model.

Since there is no ‘ground truth’, the target-to-source model tries to first create its version of the the source language, which is a rather a ‘noisy’ version and the ‘source-to-target’ model tries to recreate the original source code, in parallel, using the output of the noisy translation. These two model train in parallel until convergence, when we have perfect translation of the original source code.

Model architecture:

As the paper states, the model uses a transformer architecture with 6 layers, 8 attention head with a dimensionality of 1024, using a single encoder and decoder for all programming languages. This simplifies the model to a large extent, reducing memory requirement with the most of the heavy lifting done during the process of learning the shared embedding’s.

The models were implemented in ‘PyTorch’ and trained on 32 V100 GPUs. To further minimize memory requirements and speed up computations, float16 operations were used. The ‘Transcoder’ was optimized using Adam optimizer, with a learning rate of 10–4

“During pretraining, we alternate between batches of C++, Java, and Python, composed of 32 sequences of source code of 512 tokens. At training time, we alternate between the denoising auto-encoding and back-translation objectives, and use batches of around 6000 tokens” (excerpt from the research paper)

The use of a common, universal tokenizer for all languages was found to be suboptimal because of the difference in language constructs across languages and hence language specific tokenizer had to be used, for instance, 'javalang' tokenizer for Java, ‘clang’ tokenizer for C++ and the inbuilt tokenizer for Python source code, to ensure that the tokenized sequences are valid representations of the language specific keywords, operators and constructs.

Cross-lingual token embedding space

Below is a 2 dimensional t-SNE representation of the shared embedding space, which showcases the embedding for language specific keywords. As we can see, keywords that mean the same or are used in similar contexts, are very close to each other.

Model training:

The model was trained on more than 2.8 million open source GitHub repositories from the Github public dataset. From the millions of projects that were available the transcompiler was trained to evaluate functions, instead of the entire project (since this is what is eventually expected) as functions are more concise enough to fit into memory and also allows for simpler evaluation during unit tests.

Model evaluation:

The transcoder’s performance was evaluated on 852 parallel functions extracted from ‘GeeksforGeeks’, which is an online platform hosting problems and their solutions in several programming languages. They used a new used metric — computation accuracy, to test whether the hypothesis functions generate the same output, given the same inputs, though there are other metric such as ‘reference match’ and ‘BLEU’ score.

The Transcoders generated multiple translations using beam search decoding and only those hypothesis were considered, which had the highest log-probabilities.

Doing so, Facebook was able to infer that even though the best performing version of the Transcoder did not yield functions that were an exact replica of the reference code, its translations had high ‘computation accuracy’

Below is the result of unit test using the ‘computation accuracy’ metric with Beam 1. These scores are evaluated against baseline scores that are derived by experts using rule based systems. ‘Beam 1’ is greedy decoding meaning the translation with highest computation accuracy is kept and others are discarded. This is done for speed efficiency, although the paper also mentions about types of beam searches. I mention only this to keep the clutter down.

Transcoder evaluation on sample code

According to the authors the transcoders exhibited superior translation capabilities, as compared to rules based models, and were successfully capable of converting between data types, finding the equivalent data structure, methods and libraries between languages.

For example in the below source code, where we have a C++ -> Java translation, the source code, C++ function uses a character pointer (char *str) as an argument , which defined array level accesses within the code block and this is correctly interpreted by the transcoder as of type ‘string’ and uses the string specific methods in the translated version.

The same source code was tested again, but this time changing the argument names from ‘str’ to ‘arr’ in the source, which resulted in the transcoder evaluating the java equivalent to be a character array (char [] array), and consequently used the character specific methods in the translated version. This demonstrate significant advantage over rule based systems using statistical inference on human written code, which is also less error-prone.

Far from perfect

Though transcoders have proven to be far superior than rule based systems, they are far from the perfect. The authors mention about cases where translations have failed mainly due to compilation errors. Most of the compilation errors have been due to overloaded methods or functions, need for type casting etc. But this could probably be addressed given the model has inference on sufficient code base to understand difference in method calls in programming languages. Some example from the research paper, where translations have failed are shown below.

Summary:

What we have seen from the Facebook AI research (FAIR) is giant leap towards the machine code translation. Though this may be in its nascent stages, there is definitely huge potential, which can translate into billions of dollars in saved revenue, less human intervention and less susceptibility to errors. Though they may be far from perfect, for now, they can assist humans by doing most of the heavy lifting in code translation and who knows, we may have the perfect transcoder in the near future.

Thanks for reading this article. I have tried to be as accurate as possible in my interpretation of the research paper. Looking forward to hearing from you. Cheers.

References