Foundation Models Are Reshaping How Developers Code Together

Table Of Links

Discussions

Implications for Designing and Investigating FM-powered SE collaboration tools. The most important finding from our study is that developers do share their conversations with ChatGPT while contributing to open-source projects. This insight opens a new view for researchers and FM practitioners assessing the role and influence of FM-powered software development tools, such as ChatGPT, within the realm of collaborative coding. It underscores the potential of these tools to not only assist individual developers but also to enhance the collective productivity and innovation of open-source communities. Furthermore, our study provides several taxonomies that researchers can further utilize to characterize developers’ interactions with ChatGPT or other FM-powered software development tools. For instance, the taxonomy and annotated prompts in RQ1 can be leveraged to develop a learning-based approach that can automatically identify tasks per interest and analyze the corresponding response quality. Designers can also leverage our reported frequency of software engineering tasks to prioritize improvement for their tools. The answers to RQ3 reveal how developers with different roles use shared conversations with ChatGPT in collaborative coding, which can be used to design FM-powered tools tailored to support developers with other roles.

Implications for Benchmarking FM for SE tasks

Our findings from RQ1 shed light on future benchmark designs for evaluating the impact of FMs in different types of software engineering tasks. In RQ1, we find multiple types of input for code generation and issues resolving inquiries, but those types are not fully captured by existing benchmarks. For instance, the widely recognized code generation benchmark, HumanEval (Chen et al., 2021), relies on textual specifications and method signatures.

Yet, our analysis shows that nearly half of the code generation prompts (47%) include initial code drafts alongside textual descriptions. Similarly, our examination of prompts categorized under (C4) Issue resolving indicates that a significant portion (36%) of issue resolution requests involve sharing error messages or execution traces, often without accompanying source code. Therefore, we recommend that researchers designing future benchmarks take these findings into account.

Our observation that multi-turn conversations are often utilized also motivates future evaluation of FMs allowing multi-turn interactions. Currently, there are only a few studies allowing multi-turn code generation (Wang et al., 2024; Nijkamp et al., 2022). Last but not least, we observed many other tasks beyond code generation and issue resolution, such as code review, conceptual question, and documentation, which are rarely considered as benchmark tasks for FM-powered software development tools.

Implications for Prompt Engineering. The findings from RQ2 highlight the frequent use of multi-turn strategies to improve ChatGPT’s solutions iteratively. The flow chart shown in Figure 5 illustrates the diverse approaches developers employ in these interactions. This finding motivates future investigations into the efficiency of developers’ prompting techniques within these multi-turn conversations. Specifically, whether the best practices in prompt engineering have been applied and whether improved prompts can effectively alter the flow of these interactions is a future direction for enhancing the utility and effectiveness of FM-powered tools in software development.

Authors

Huizi Hao
Kazi Amit Hasan
Hong Qin
Marcos Macedo
Yuan Tian
Steven H. H. Ding
Ahmed E. Hassan

This paper is available on arxiv under CC BY-NC-SA 4.0 license.