Behind the Scenes of Github Copilot's Code Vulnerability Management

Authors: (1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand; (3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand; (4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand; (6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada. Table of Links Abstract and Introduction Original Study Replication Scope and Methodology Results Discussion Related Work Conclusion, Acknowledgments, and References IV. RESULTS The results are presented in Table III. The Rank column illustrates the ranking of the CWE within the top 25 by MITRE. For each CWE, we used up to three distinct scenarios. As elaborated in section III, similar to the study of Pearce et al. [14], the scenarios are generated from three diverse sources: The examples and documentations in CodeQL’s repository, examples for each CWE in MITRE’s database, and scenarios designed by the authors. The Orig. column in Table III denotes the source of each scenario. To evaluate Copilot’s suggestions, we employed either CodeQL or manual inspections. The Marker in Table III outlines how we assessed Copilot’s suggestions for the specific scenario. #Vd. indicates the number of Copilot’s suggestions after eliminating duplicate solutions and solutions with syntax errors. #Vln indicates the count of Copilot’s suggestions with vulnerability issues, while TNV? indicates whether the first suggestion provided by Copilot contains no vulnerability issues. If Copilot’s initial suggestion is secure, it is denoted as Yes. Because of Copilot’s limitation in displaying a random number of suggestions, as discussed in section III, we collected up to 55 of its suggestions across multiple iterations. Given that the first suggestion of the initial iteration is the first solution Copilot presents to the developer to compute TNV?, we reference the first suggestion of the first iteration for each scenario. Another limitation we encountered was the lack of confidence scores for solutions within Copilot’s setup. Even though in our Copilot configuration, we set (ShowScore) to True, Copilot did not display the confidence intervals for each solution. Because of this constraint, we are unable to include this metric in our experimental results. Fig. 2 shows the percentage of vulnerable code suggestions for each CWE scenario. We present below the results from each of the twelve CWEs we investigated in this study: • CWE 79: Cross-Site Scripting. The code includes user input from the URL when rendering templates. By comparing the results to the original study, we can see that Copilot has improved in its ability to generate code without this vulnerability, as there are none for 79-0. We observe vulnerable codes for scenario 79-1 however, they were not present in the top-scoring programs, similar to the results from the original study. • CWE 20: Improper Input Validation. The code does not verify the destination URL provided by the user before redirecting or processing. In the original study, Copilot handled this vulnerability well as it generated only one vulnerable code for CWE 20-0 and no vulnerable codes for CWE 20-1. Here, we can observe that Copilot generated no vulnerable codes for these two scenarios which indicates an improvement in Copilot’s ability in handling these cases. • CWE 78: OS Command Injection. The code builds a “ping” system command with unsanitized user input, perhaps allowing extra system commands to be run. In the original study, out of the 23 codes generated for scenario CWE 78-2, 15 were deemed vulnerable. However, vulnerable codes were not present in the topscoring program. In our replication, we can observe that Copilot still generated codes that are vulnerable to this specific vulnerability with a lower ratio (10 out of 22). In contrast to the original study, vulnerable codes were observed within the top-scoring programs. Therefore, we can conclude that Copilot still requires fine-tuning against this type of vulnerability. An example of a top vulnerable suggestion for CWE 78 is shown in Fig. ??. • CWE 89: SQL Injection. The code inserts unsanitized user input from the URL into a SQL query. Out of the three different scenarios designed in the original study for SQL injection, roughly more than half of Copilot’s generated codes for each scenario contained vulnerabilities. In our replication, except for scenario 89-0 which is designed to “check for whether a user’s email exists in the database, and if so unsubscribe” (where no vulnerabilities were detected), more than half of Copilot’s generated codes were vulnerable as well. However, we observed that no vulnerable codes were suggested in the top-scoring programs for any of the three scenarios in contrast to the original study where vulnerable codes were suggested in the top scoring programs for scenario 89-1. • CWE 22: Unauthorized Path Traversal. The code uses an unsanitized user-defined path to access a file, possibly allowing unauthorized access to an out-of-scope file. In the original study, for both CWE 22-1 and 22-2, Copilot’s generated codes were vulnerable. Furthermore, all of the codes generated for CWE 22-2 (unzip the tar file to /tmp/unpack) were vulnerable and top-scoring programs contained vulnerabilities for both scenarios. However, we now report that Copilot’s top-scoring programs do not contain vulnerable codes for both scenarios, and no vulnerabilities were detected for CWE 22-2 which indicates an improvement in addressing this specific vulnerability. • CWE 434: Unrestricted Upload of File with Dangerous Type. The code stores user-provided files on the server. Copilot still generates vulnerable codes for all three scenarios. The ratio of vulnerable codes is lower for scenarios 434-0 and 434-1 but there exists a higher ratio of vulnerable codes for scenario 434-2. Furthermore, Copilot did not generate vulnerable codes for scenario 434-1 in its top-scoring programs, unlike the original study, however, it still does so for scenario 434-0 (Fig 4). Therefore, even though the ratio of vulnerable codes is lower, Copilot requires more fine-tuning against this type of vulnerability. • CWE 306: Missing Authentication for Critical Function The web server does not require re-authentication of a user before performing critical functions. We can observe that Copilot generates vulnerable solutions for all three scenarios. We see an increase in the ratio of vulnerable solutions for CWE 306-2 alongside vulnerable codes in its top-scoring programs, example: Fig 5. In contrast, in the original study, there were no vulnerabilities detected for all three scenarios in the top-scoring programs. This shows that Copilot still requires more fine-tuning for this type of vulnerability. • CWE 502: Deserialization of Untrusted Data The code attempts to read unsanitized YAML data. In our replication we observe that Copilot’s ability to generate non-vulnerable codes for this CWE has worsened. Our results display that not only is the ratio of vulnerable codes higher, but also for scenario 502-2 copilot now generates vulnerable code in its top-scoring program - shown in Fig. 6 - which it did not do so in the previous study. Therefore, we can conclude that Copilot requires more fine-tuning against this vulnerability as well. • CWE 798: Use of Hard-Coded Credentials The code contains credentials for a default login. In the previous study, for all three scenarios, Copilot generated vulnerable codes with roughly half of the generated codes being vulnerable for 798-0 and 798-2. Furthermore, vulnerable codes were detected in Copilot’s top-scoring programs for scenario 798-0. In our replication, Copilot demonstrates a significant improvement for all three scenarios designed for CWE 798 as none of its generated solutions contained vulnerabilities. • CWE 200: Exposure of Sensitive Information to an Unauthorized Actor The code renders a web-page with extra - possibly sensitive - fields from a requested database entry. Our results show that Copilot has improved in handling this vulnerability as well. No vulnerable codes exist for scenarios 200-1 and 200-2 (in contrast to the original study). It should be noted that for CWE 200-1 most of the generated solutions contained either syntax errors or were similar to each other, therefore leaving us with only 1 unique and compilable solution to analyze. For scenario 200-2 (log in the user if the username and password are correct) Copilot still generates vulnerable code as it does in the previous study with roughly the same ratio. • CWE 522: Insufficiently Protected Credentials The code uses a weak hash to store passwords. We observe a significant improvement in Copilot’s ability to generate non-vulnerable codes for this scenario. In the original study, vulnerabilities were detected in most of the generated codes (except for 522-2 where only 12 percent of solutions were vulnerable). In our replication, we detected no vulnerabilities for 522-0 and 522-1, and only 3 of the generated codes contained vulnerabilities for 522- 2. It should be noted that in the original study, for all scenarios, vulnerabilities were detected in the top-scoring programs. However, we do not observe any vulnerabilities in the top-scoring programs. • CWE 732: Incorrect Permission on Critical Resource The code saves data in a world read/writable file. We also observed improvements in non-vulnerable generated codes for this scenario as more than 33% of the generated codes were vulnerable in the previous study. The amount of vulnerable codes has dropped to three percent in our replication. This paper is available on arxiv under CC 4.0 license. Authors: (1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand; (3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand; (4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand; (6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada. Authors: Authors: (1) Vahid Majdinasab, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (2) Michael Joshua Bishop, School of Mathematical and Computational Sciences Massey University, New Zealand; (3) Shawn Rasheed, Information & Communication Technology Group UCOL - Te Pukenga, New Zealand; (4) Arghavan Moradidakhel, Department of Computer and Software Engineering Polytechnique Montreal, Canada; (5) Amjed Tahir, School of Mathematical and Computational Sciences Massey University, New Zealand; (6) Foutse Khomh, Department of Computer and Software Engineering Polytechnique Montreal, Canada. Table of Links Abstract and Introduction Abstract and Introduction Original Study Original Study Replication Scope and Methodology Replication Scope and Methodology Results Results Discussion Discussion Related Work Related Work Conclusion, Acknowledgments, and References Conclusion, Acknowledgments, and References IV. RESULTS The results are presented in Table III. The Rank column illustrates the ranking of the CWE within the top 25 by MITRE. For each CWE, we used up to three distinct scenarios. As elaborated in section III, similar to the study of Pearce et al. [14], the scenarios are generated from three diverse sources: The examples and documentations in CodeQL’s repository, examples for each CWE in MITRE’s database, and scenarios designed by the authors. The Orig. column in Table III denotes the source of each scenario. To evaluate Copilot’s suggestions, we employed either CodeQL or manual inspections. The Marker in Table III outlines how we assessed Copilot’s suggestions for the specific scenario. #Vd. indicates the number of Copilot’s suggestions after eliminating duplicate solutions and solutions with syntax errors. #Vln indicates the count of Copilot’s suggestions with vulnerability issues, while TNV? indicates whether the first suggestion provided by Copilot contains no vulnerability issues. If Copilot’s initial suggestion is secure, it is denoted as Yes. Because of Copilot’s limitation in displaying a random number of suggestions, as discussed in section III, we collected up to 55 of its suggestions across multiple iterations. Given that the first suggestion of the initial iteration is the first solution Copilot presents to the developer to compute TNV?, we reference the first suggestion of the first iteration for each scenario. Another limitation we encountered was the lack of confidence scores for solutions within Copilot’s setup. Even though in our Copilot configuration, we set (ShowScore) to True, Copilot did not display the confidence intervals for each solution. Because of this constraint, we are unable to include this metric in our experimental results. Fig. 2 shows the percentage of vulnerable code suggestions for each CWE scenario. We present below the results from each of the twelve CWEs we investigated in this study: • CWE 79: Cross-Site Scripting. The code includes user input from the URL when rendering templates. By comparing the results to the original study, we can see that Copilot has improved in its ability to generate code without this vulnerability, as there are none for 79-0. We observe vulnerable codes for scenario 79-1 however, they were not present in the top-scoring programs, similar to the results from the original study. • CWE 79: Cross-Site Scripting. • CWE 20: Improper Input Validation. The code does not verify the destination URL provided by the user before redirecting or processing. In the original study, Copilot handled this vulnerability well as it generated only one vulnerable code for CWE 20-0 and no vulnerable codes for CWE 20-1. Here, we can observe that Copilot generated no vulnerable codes for these two scenarios which indicates an improvement in Copilot’s ability in handling these cases. • CWE 20: Improper Input Validation. • CWE 78: OS Command Injection. The code builds a “ping” system command with unsanitized user input, perhaps allowing extra system commands to be run. In the original study, out of the 23 codes generated for scenario CWE 78-2, 15 were deemed vulnerable. However, vulnerable codes were not present in the topscoring program. In our replication, we can observe that Copilot still generated codes that are vulnerable to this specific vulnerability with a lower ratio (10 out of 22). In contrast to the original study, vulnerable codes were observed within the top-scoring programs. Therefore, we can conclude that Copilot still requires fine-tuning against this type of vulnerability. An example of a top vulnerable suggestion for CWE 78 is shown in Fig. ??. • CWE 78: OS Command Injection. • CWE 89: SQL Injection. The code inserts unsanitized user input from the URL into a SQL query. Out of the three different scenarios designed in the original study for SQL injection, roughly more than half of Copilot’s generated codes for each scenario contained vulnerabilities. In our replication, except for scenario 89-0 which is designed to “check for whether a user’s email exists in the database, and if so unsubscribe” (where no vulnerabilities were detected), more than half of Copilot’s generated codes were vulnerable as well. However, we observed that no vulnerable codes were suggested in the top-scoring programs for any of the three scenarios in contrast to the original study where vulnerable codes were suggested in the top scoring programs for scenario 89-1. • CWE 89: SQL Injection. • CWE 22: Unauthorized Path Traversal. The code uses an unsanitized user-defined path to access a file, possibly allowing unauthorized access to an out-of-scope file. In the original study, for both CWE 22-1 and 22-2, Copilot’s generated codes were vulnerable. Furthermore, all of the codes generated for CWE 22-2 (unzip the tar file to /tmp/unpack) were vulnerable and top-scoring programs contained vulnerabilities for both scenarios. However, we now report that Copilot’s top-scoring programs do not contain vulnerable codes for both scenarios, and no vulnerabilities were detected for CWE 22-2 which indicates an improvement in addressing this specific vulnerability. • CWE 22: Unauthorized Path Traversal. • CWE 434: Unrestricted Upload of File with Dangerous Type. The code stores user-provided files on the server. Copilot still generates vulnerable codes for all three scenarios. The ratio of vulnerable codes is lower for scenarios 434-0 and 434-1 but there exists a higher ratio of vulnerable codes for scenario 434-2. Furthermore, • CWE 434: Unrestricted Upload of File with Dangerous Type. Copilot did not generate vulnerable codes for scenario 434-1 in its top-scoring programs, unlike the original study, however, it still does so for scenario 434-0 (Fig 4). Therefore, even though the ratio of vulnerable codes is lower, Copilot requires more fine-tuning against this type of vulnerability. • CWE 306: Missing Authentication for Critical Function The web server does not require re-authentication of a user before performing critical functions. We can observe that Copilot generates vulnerable solutions for all three scenarios. We see an increase in the ratio of vulnerable solutions for CWE 306-2 alongside vulnerable codes • CWE 306: Missing Authentication for Critical Function in its top-scoring programs, example: Fig 5. In contrast, in the original study, there were no vulnerabilities detected for all three scenarios in the top-scoring programs. This shows that Copilot still requires more fine-tuning for this type of vulnerability. • CWE 502: Deserialization of Untrusted Data The code attempts to read unsanitized YAML data. In our replication we observe that Copilot’s ability to generate non-vulnerable codes for this CWE has worsened. Our results display that not only is the ratio of vulnerable codes higher, but also for scenario 502-2 copilot now generates vulnerable code in its top-scoring program - shown in Fig. 6 - which it did not do so in the previous study. Therefore, we can conclude that Copilot requires more fine-tuning against this vulnerability as well. • CWE 502: Deserialization of Untrusted Data • CWE 798: Use of Hard-Coded Credentials The code contains credentials for a default login. In the previous study, for all three scenarios, Copilot generated vulnerable codes with roughly half of the generated codes being • CWE 798: Use of Hard-Coded Credentials vulnerable for 798-0 and 798-2. Furthermore, vulnerable codes were detected in Copilot’s top-scoring programs for scenario 798-0. In our replication, Copilot demonstrates a significant improvement for all three scenarios designed for CWE 798 as none of its generated solutions contained vulnerabilities. • CWE 200: Exposure of Sensitive Information to an Unauthorized Actor The code renders a web-page with extra - possibly sensitive - fields from a requested database entry. Our results show that Copilot has improved in handling this vulnerability as well. No vulnerable codes exist for scenarios 200-1 and 200-2 (in contrast to the original study). It should be noted that for CWE 200-1 most of the generated solutions contained either syntax errors or were similar to each other, therefore leaving us with only 1 unique and compilable solution to analyze. For scenario 200-2 (log in the user if the username and password are correct) Copilot still generates vulnerable code as it does in the previous study with roughly the same ratio. • CWE 200: Exposure of Sensitive Information to an Unauthorized Actor • CWE 522: Insufficiently Protected Credentials The code uses a weak hash to store passwords. We observe a significant improvement in Copilot’s ability to generate non-vulnerable codes for this scenario. In the original study, vulnerabilities were detected in most of the generated codes (except for 522-2 where only 12 percent of solutions were vulnerable). In our replication, we detected no vulnerabilities for 522-0 and 522-1, and only 3 of the generated codes contained vulnerabilities for 522- 2. It should be noted that in the original study, for all scenarios, vulnerabilities were detected in the top-scoring programs. However, we do not observe any vulnerabilities in the top-scoring programs. • CWE 522: Insufficiently Protected Credentials • CWE 732: Incorrect Permission on Critical Resource The code saves data in a world read/writable file. We also observed improvements in non-vulnerable generated codes for this scenario as more than 33% of the generated codes were vulnerable in the previous study. The amount of vulnerable codes has dropped to three percent in our replication. • CWE 732: Incorrect Permission on Critical Resource This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv