GitHub Copilot in Practice: Empirical Insights into User Experiences and Practical Challenges

Written by textmodels | Published 2024/03/04
Tech Story Tags: github-copilot | ai-code | ai-code-generation | can-ai-code | software-development | ai-applications | copilot-usage-challenges | github-copilot-user-experience

TLDR An empirical study conducted by software developers investigates the issues, causes, and solutions encountered during real-world usage of GitHub Copilot. By systematically analyzing data from GitHub Issues, GitHub Discussions, and Stack Overflow posts, this study provides valuable insights into the challenges faced by developers and offers potential solutions to enhance the usability of Copilot.via the TL;DR App

Authors:

(1) Xiyu Zhou, School of Computer Science, Wuhan University, Wuhan, China;

(2) Peng Liang, School of Computer Science, Wuhan University, Wuhan, China;

(3) Zengyang Li, School of Computer Science, Central China Normal University, Wuhan, China;

(4) Aakash Ahmad, School of Computing and Communications, Lancaster University Leipzig, Leipzig, Germany;

(4) Mojtaba Shahin, School of Computing Technologies, RMIT University, Melbourne, Australia;

(4) Muhammad Waseem, Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland.

II. METHODOLOGY

The goal of this study is to systematically identity the issues that developers encountered in the practical application of GitHub Copilot, as well as their underlying causes and potential solutions. We formulated three RQs to direct subsequent phases of the methodology as shown in Fig. 1. The RQs and their rationales are detailed in Section II-A.

A. Research Questions

RQ1: What are the issues faced by users while using Copilot in the software development practice? Rationale: Copilot is a relatively new product, and little is known about the specific challenges and issues users face while using it in the software development practice. By identifying these issues, this study can help us gain a better understanding of the obstacles that arise when using AI code generation tools like Copilot in practical situations from the perspective of software developers.

RQ2: What are the underlying causes of these issues? Rationale: Understanding the underlying causes of the issues identified in RQ1 is essential for developing effective solutions to address the issues. By identifying these causes, the study can provide insights into how to improve the design and functionality of Copilot.

RQ3: What are the potential solutions to address these issues? Rationale: RQ3 aims to identify potential solutions to the issues identified in RQ1, which can ultimately enhance the usability of Copilot for users. By exploring the existing practices and methods utilized by developers to tackle the issues, the study can gain insights into potential solutions that enhance the functionality and usability of Copilot.

B. Data Collection

We collected data from three sources: GitHub Issues[1] , GitHub Discussions[2] , and SO posts[3] . GitHub Issue is a commonly used feature on GitHub for tracking bugs, feature requests, and other issues related to software development projects, which allow us to capture the specific problems and difficulties that users have encountered when coding with Copilot. GitHub Discussion, on the other hand, is a newer feature for more open-ended discussions among project contributors and community members, which also offers a central hub for project-related discussions and knowledge sharing among community members. Discussion topics can cover a wide range from technical questions and proposals to broader topics related to Copilot. Stack Overflow is a popular technology community that provides a Q&A platform covering a wide range of programming, development, and technical questions, including Copilot-related questions.

Considering that Copilot was announced and started its technical preview on June 29, 2021, we chose to collect the data that were created after that date. The data collection was conducted on June 18, 2023. To assist us in addressing RQ2 and RQ3, we chose to collect closed GitHub issues and answered GitHub discussions, which contain known causes and solutions. For SO posts, we found that the amount of Copilot-related posts was relatively low. Hence, we chose to retrieve all potentially relevant posts, including unanswered ones, to obtain a more comprehensive dataset. For GitHub Issues, we used “Copilot” as the keyword to search closed Copilot-related issues in the entire GitHub, and a total of 4,057 issues were retrieved. We also employed “Copilot” as the keyword to search at SO, resulting in 679 retrieved posts. Different from GitHub Issues and SO posts, GitHub Discussions related to certain product are grouped under a specific subcategory, and “Copilot” was included as a subcategory under the “Product” category. Given the high relevance of these discussions to Copilot, we collected all the discussions that were answered under the “Copilot” subcategory as part of the data source for our study. In total, 925 discussions were obtained.

C. Data Filtering

To ensure that the data can be used to answer our RQs, we set the filtering criterion: the issue, discussion, or post should contain specific information related to the use of Copilot. We conducted the data labelling on the collected data to filter out the data which cannot be used for this study by following the criterion.

  1. Pilot Data labelling: To minimize personal bias in the formal labelling process, the first and third authors conducted a pilot data labelling. For GitHub issues and discussions, we randomly selected 100 and 25 respectively, which constitute 2.5% of the total count. Due to the smaller quantity of SO posts, we randomly selected 35, which constitutes 5% of the total posts. Selecting a certain proportion of data from different platforms respectively is to verify whether the criteria of the two authors are consistent across various platforms. The consistency of the labelled issues was measured by the Cohen’s Kappa coefficient [14], resulting in values of 0.824, 0.834, and 0.806. The results indicate that a reasonable level of agreement between the two authors. For the labelling results with differences of opinion, the two authors will engage in discussions with the second author to reach a consensus.

2) Data labelling: The first and third authors then conducted the formal data labelling. During this process, we excluded a large number of data not related to our research. For instance, in some issues, Copilot refers to other meanings, such as the co-pilot of an aircraft. Additionally, Copilot might be mentioned in a straightforward manner without additional information, like someone said, “You can try using Copilot, which is amazing”. We also excluded such issues since they could not provide useful information about the use of Copilot. During the labelling process, any result on which the two authors disagreed was subject to discussion with the second author until an agreement was reached. Ultimately, the two authors collected 476 GitHub issues, 706 GitHub discussions, and 184 SO posts for data extraction.

D. Data Extraction and Analysis

  1. Extract Data: To answer the three RQs mentioned in Section II-A, we established a set of data items for data extraction, as presented in Table I. Data items D1-D3 intend to extract the information of issues, underlying causes, and possible solutions from the filtered data, to answer RQ1-RQ3, respectively.

The first author conducted a pilot data extraction with the third author on 20 randomly selected GitHub issues, 20 discussions, and 20 SO posts, and in case of any discrepancies, the second author was involved to reach a consensus. Based on the observation, we established the following standards for data extraction: (1) If the same issue was identified by multiple users, we recorded it only once. (2) If multiple problems were identified within the same issue, discussion, or posts, we recorded each one separately. (3) For an issue that has multiple causes, we only recorded the confirmed cause by the contributor of the issue or the Copilot team as the root cause. (4) For an issue that has multiple solutions suggested, we only recorded the solutions that were confirmed by the contributor of the issue or the Copilot team to actually solve the issue.

The first and third authors conducted data extraction from the filtered issues, discussions, and posts based on the data items, and then discussed and reached a consensus with the second author on inconsistencies to ensure that the data extraction process adhered to the predetermined criteria. Each extracted data item was reviewed multiple times by the three authors to ensure accuracy. The final data extraction results were compiled and recorded in MS Excel [13].

2) Analyze Data: To answer the three RQs in Section II, we conducted data analysis using the Constant Comparison method [12]. The specific steps are as follows: 1) The first author meticulously reviewed the data obtained during the data extraction phase and assigned codes to each data. These codes constituted descriptive summarizations of the data, aimed at capturing the underlying themes. For instance, the issue in Discussion #10598 was coded as “Stopped Giving Inline Suggestions”. 2) The first author compared different codes to identify patterns, commonalities, and distinctions among them. Through this iterative comparison process, similar codes coalesced into higher-level types and categories. For example, the coding of Discussion #10598, along with other akin codes, fell under the same type of FUNCTIONALITY USAGE ISSUE, which further belonged to the category of Usage Issue. Ultimately, the first author engaged in discussions with the second and third authors to achieve a consensus on the taxonomies of issues, causes, and solutions, which are presented in Section III.


[1] https://docs.github.com/en/issues

[2] https://github.com/orgs/community/discussions/categories/copilot

[3] https://stackoverflow.com/

This paper is available on arxiv under CC 4.0 license.


Written by textmodels | We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.
Published by HackerNoon on 2024/03/04