This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;
(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;
(3) Cristina Improta, University of Naples Federico II, Naples, Italy;
(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;
(5) Roberto Natella, University of Naples Federico II, Naples, Italy.
Table of Links
2. Motivating Example
To generate (offensive) code from natural language (NL), AI code generators are fed with corpora containing pairs of NL intents (inputs) and code snippets (output). These corpora are commonly split into training data, i.e., the data used to feed the model, validation data, i.e., the data used to tune the model’s parameters, and test data, i.e., the data used to evaluate the model in the generation of the code starting from new NL descriptions (i.e., the NL intents in the test data are never seen by the model in the train and validation data).
The most practical solution to assess the performance of the NMT models in the code generation is to compare, for every NL description of the test data (i.e., the input), the model’s prediction with the code snippet (i.e., the output) in the test set, which is considered the ground-truth for the evaluation. To this aim, state-of-the-art provides a set of metrics that estimate the similarity between the code generated by NMT models and the code snippets in the test set. However, output similarity metrics cannot properly assess whether two pieces of code are different but semantically equivalent, i.e., they provide the same output and/or effects although they use different operations (e.g., jz label and je label are different assembly instructions performing the same conditional jump).
For this reason, human evaluation is considered the golden standard for assessing the correctness of the code generated by the models [16]. Through manual inspection of every model’s predictions, human evaluators assess if the code generated by the models is semantically correct, i.e., if the output is the exact translation of the NL intent into the target programming language. Semantic correctness implies syntax correctness, i.e., a code prediction that performs what is described in the NL intent must also adhere to the syntax rules of the target programming languages. Human evaluation classifies the code as correct or incorrect by assigning a value equal to 1 or 0, respectively.
As a simple example, consider the intent “transfer EAX contents into EDX register”, which translates, on the 32-bit version of the x86 instruction set architecture (IA-32), to the assembly snippet:
mov EDX, EAX
An alternative method to copy the contents of a register into another is by pushing and popping its value onto the stack. Therefore, a semantically equivalent implementation of this copy is the code:
push EAX
pop EDX
Despite the model’s prediction being both syntactically and semantically correct, output similarity metrics are not able to grasp the equivalence between the two snippets since they base their calculation on character and/or token similarity. Therefore, this translation results in low scores[1] for several output similarity metrics widely used in the field (see § 4.3), such as BLEU-4 (0.11) and Edit Distance (0.31).
The opposite occurs with the intent “clear the EDX register and move 5 in the lowest byte of the register”, which translates to the assembly snippet:
xor EDX, EDX
mov DL, 5
If the model generates the snippet:
xor EDX, EDX
mov BL, 5
then prediction and reference differ by a single character, yet the code does not accomplish the same task. Indeed, the lowest byte of EDX is stored in the DL register, while BL contains the lowest byte of EBX. Automatic metrics fail to account for situations like this. For instance, the Edit Distance between these two pieces of code is 0.96, while the BLEU-4 is 0.65, which are considered high values. Differently, a human evaluator would appropriately classify this snippet as semantically incorrect, since it does not perform the intended operation, although it properly respects the syntax of the assembly language.
However, since the human analyst needs to check the syntax and the semantics of every output generated by the models, human evaluation is often unfeasible. Indeed, the huge amount of data to scrutinize makes the analysis time-consuming and prone to errors.
[1] Scores of output similarity metrics range between 0 and 1.