Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 1.1 The twincode platform 1.1 The twincode platform 1.2 Pilot Studies 1.2 Pilot Studies 1.3 Other Gender Identities and 1.4 Structure of the Paper 1.3 Other Gender Identities and 1.4 Structure of the Paper 2 Related Work 2 Related Work 3 Original Study (Seville Dec, 2021) and 3.1 Participants 3 Original Study (Seville Dec, 2021) and 3.1 Participants 3.2 Experiment Execution 3.2 Experiment Execution 3.3 Factors (Independent Variables) 3.3 Factors (Independent Variables) 3.4 Response Variables (Dependent Variables) 3.4 Response Variables (Dependent Variables) 3.5 Confounding Variables 3.5 Confounding Variables 3.6 Data Analysis 3.6 Data Analysis 4 First Replication (Berkeley May, 2022) 4 First Replication (Berkeley May, 2022) 4.1 Participants 4.1 Participants 4.2 Experiment Execution 4.2 Experiment Execution 4.3 Data Analysis 4.3 Data Analysis 5 Discussion and Threats to Validity and 5.1 Operationalization of the Cause Construct — Treatment 5 Discussion and Threats to Validity and 5.1 Operationalization of the Cause Construct — Treatment 5.2 Operationalization of the Effect Construct — Metrics 5.2 Operationalization of the Effect Construct — Metrics 5.3 Sampling the Population — Participants 5.3 Sampling the Population — Participants 6 Conclusions and Future Work 6 Conclusions and Future Work 6.1 Replication in Different Cultural Background 6.1 Replication in Different Cultural Background 6.2 Using Chatbots as Partners and AI-based Utterance Coding 6.2 Using Chatbots as Partners and AI-based Utterance Coding Datasets, Compliance with Ethical Standards, Acknowledgements, and References Datasets, Compliance with Ethical Standards, Acknowledgements, and References A. Questionnaire #1 and #2 response items A. Questionnaire #1 and #2 response items B. Evolution of the twincode User Interface B. Evolution of the twincode User Interface C. User Interface of tag-a-chat C. User Interface of tag-a-chat A Questionnaire #1 and #2 response items In this section, the response items of the scales used in questionnaires #1 and #2 are enumerated. Those scales were analyzed for internal consistency using the data collected during the pilot studies, and the results of those analysis consisting in the Pearson’s correlations, Cronbach’s α, and principal components scree plot are also reported [57], indicating whether some response items were dropped or not according to the obtained results. A.1 Response items for perceived productivity scale (pp) All the items in this questionnaire section, entitled as “Solo programming or pair programming?”, are 0–10 numerical response items in which 0 means “programming solo”, 5 means “the same in both cases”, 10 means “programming in pairs”. pp1 Regarding the programming exercises you just did, how do you think you would have been more productive, programming solo or programming with the partner assigned to you? pp1 more productive pp2 Regarding the programming exercises you just did, how do you think you would have achieved a better program quality, programming solo or programming with the partner assigned to you? pp2 better program quality pp3 Regarding the programming exercises you just did, how do you think you would have developed a more reliable program, i.e., a program more likely to run without failures, programming solo or programming with the partner assigned to you? pp3 more reliable pp4 Regarding the programming exercises you just did, how do you think you would have enjoyed more, programming solo or programming with the partner assigned to you? pp4 enjoyed more As shown in Figure 12, all the items presented high Pearson correlations with Cronbach’s α = 0.83, and the scree plot confirmed they were unidimensional according to the Kaiser criterion. As a result, all of them were kept after the reliability analysis on the data from the pilot studies. A.2 Response items for partner’s perceived technical competency (pptc) All the items in this questionnaire section, entitled as “My partner or me?”, are 0–10 numerical response items in which 0 means “me”, 5 means “both equally”, 10 means “my partner”. pptc1 During the programming exercises you just did, who do you think had more knowledge and technical skills, you or the partner assigned to you? pptc1 knowledge and technical skills pptc2 During the programming exercises you just did, who do you think has been more cooperative, you or the partner assigned to you? pptc2 cooperative pptc3 During the programming exercises you just did, who do you think has had a faster pace at solving the exercises, you or the partner assigned to you? pptc3 aster pace at solving the exercises pptc4 During the programming exercises you just did, who do you think has led more to the solutions, you or the partner assigned to you? pptc4 led more to the solutions As shown in Figure 13, in the initial version of the scale used in the pilot studies, the pptc5 item, which asked whether the assigned partner had been condescending, presented low correlations with the rest of the items in the scale and the scree plot indicated two factors. After removing that uncorrelated item, the Cronbach’s α increased from 0.73 to 0.85, and the scree plot indicated only one factor, as shown in Figure 14. A.3 Response item for partner’s perceived positive and negative aspects (ppa and pna) The only item in this questionnaire section, entitled as “Describe your partner”, is a free text field in which subjects are instructed to describe the most positive and most negative aspects of the partner assigned to them in the programming exercises they just did, indicating the positive ones with a ”+” sign and the negative ones with a ”-” sign in front of each aspect. A.4 Response items for compared partners’ skills (cps) All the items in this questionnaire section, entitled as “First or second partner?”, are 0–10 numerical response items in which 0 means “first partner”, 5 means “both equally”, 10 means “second partner”. cps1 Comparing your assigned partners in sessions 1 and 3, who do you think provided more clear and constructive feedback, your first partner or your second partner? more clear and constructive feedback cps2 Comparing your assigned partners in sessions 1 and 3, who do you think was easier to communicate with, your first partner or your second partner? cps2 easier to communicate with cps3 Comparing your assigned partners in sessions 1 and 3, who do you think who do you think was more knowledgeable about the subject material, your first partner or your second partner? cps3 more knowledgeable about the subject material, cps4 Comparing your assigned partners in sessions 1 and 3, who do you think would be a better project partner, your first partner or your second partner? cps4 better project partner cps5 Comparing your assigned partners in sessions 1 and 3, who do you think would be a better teaching assistant, your first partner or your second partner cps5 better teaching assistant As shown in Figure 15, all the items presented high Pearson correlations with Cronbach’s α = 0.88, and the scree plot confirmed they were unidimensional according to the Kaiser criterion. As a result, all of them were kept after the reliability analysis on the data from the pilot studies. B Evolution of the twincode User Interface The twincode user interface used in the external replication at UC Berkeley is shown in Figure 16(a) and 16(b). C User Interface of tag-a-chat The user interface of the tag-a-chat tool used for collaboratively coding chat utterances is shown in Figure 17. Authors: (1) Amador Duran, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain (amador@us.es); (2) Pablo Fernandez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain (pablofm@us.es); (3) Beatriz Bernardez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain (beat@us.es); (4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, USA (nweinman@berkeley.edu); (5) Aslıhan Akalın, Computer Science Division, University of California, Berkeley, Berkeley, USA (asliakalin@berkeley.edu); (6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, USA (fox@berkeley.edu). Authors: Authors: (1) Amador Duran, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain (amador@us.es); (2) Pablo Fernandez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain (pablofm@us.es); (3) Beatriz Bernardez, I3US Institute, Universidad de Sevilla, Sevilla, Spain and SCORE Lab, Universidad de Sevilla, Sevilla, Spain (beat@us.es); (4) Nathaniel Weinman, Computer Science Division, University of California, Berkeley, Berkeley, USA (nweinman@berkeley.edu); (5) Aslıhan Akalın, Computer Science Division, University of California, Berkeley, Berkeley, USA (asliakalin@berkeley.edu); (6) Armando Fox, Computer Science Division, University of California, Berkeley, Berkeley, USA (fox@berkeley.edu). This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv available on arxiv