What 316 GitHub Issues Teach Us About the Dark Side of Open Source

Authors:

(1) Hsu Myat Win, Southern University of Science and Technology, China ([email protected]);

(2) Haibo Wang, Southern University of Science and Technology, China ([email protected]);

(3) Shin Hwei Tan, a corresponding author from Southern University of Science and Technology, China ([email protected]).

Table of Links

Abstract and 1. Introduction

ABSTRACT

Given the rapid growth of Open-Source Software (OSS) projects, ethical considerations are becoming more important. Past studies focused on specific ethical issues (e.g., gender bias and fairness in OSS). There is little to no study on the different types of unethical behavior in OSS projects. We present the first study of unethical behavior in OSS projects from the stakeholders’ perspective. Our study of 316 GitHub issues provides a taxonomy of 15 types of unethical behavior guided by six ethical principles (e.g., autonomy). Examples of new unethical behavior include soft forking (copying a repository without forking) and self-promotion (promoting a repository without self-identifying as contributor to the repository). We also identify 18 types of software artifacts affected by the unethical behavior. The diverse types of unethical behavior identified in our study (1) call for attentions of developers and researchers when making contributions in GitHub, and (2) point to future research on automated detection of unethical behavior in OSS projects. Based on our study, we propose Etor, an approach that can automatically detect six types of unethical behavior by using ontological engineering and Semantic Web Rule Language (SWRL) rules to model GitHub attributes and software artifacts. Our evaluation on 195,621 GitHub issues (1,765 GitHub repositories) shows that Etor can automatically detect 548 unethical behavior with 74.8% average true positive rate. This shows the feasibility of automated detection of unethical behavior in OSS projects.

1 INTRODUCTION

With the increasing popularity of Open-Source Software (OSS) development, ethical considerations have become an important yet often neglected topic within the research community. For example, the incident where researchers investigated the feasibility of stealthily introducing vulnerabilities in OSS by making hypocrite commits (commits that deliberately introduces critical bugs into code), has provoked active discussion among the Linux community, researchers, and other OSS developers [81]. The Linux developers argued that making “hypocrite commits” is “not ethical”, and wasting developers’ time in reviewing invalid patches [36]. More importantly, this incident has revealed an attack on the basic premise of OSS itself (i.e., the fact that anyone can contribute to the code and any OSS project is susceptible to a similar incident). Indeed, unethical behavior committed by OSS contributors might lead to broken trust between the OSS community and the contributor, whereas unethical software development might lead to loss of funding, reputation, or other resources of the OSS organization involved. Despite the importance of understanding the unethical issues by stakeholders (individuals who participated or interested in the OSS project, and can either affect or be affected by the OSS project), most studies on unethical behavior in OSS projects mainly focuses on the common types of unethical behavior, such as gender bias [59, 76], fairness in the code review process [53], and software licensing [64, 78, 79]. There is little to no study that investigates the important question: “What kind of behavior is considered unethical by stakeholders in OSS projects?”. Without understanding the definition of unethical behavior from the perspective of the stakeholders of OSS projects, incidents similar to the “hypocrite commits” experiments are bound to reoccur.

Prior studies stress the importance of considering ethical issues in OSS projects by using various examples and referring to ethical principles [54, 56, 73]. Unfortunately, a study revealed that instructing participants to consider the ACM code of ethics does not affect their ethical decision-making in software engineering tasks [68]. A similar argument has been made in AI ethics, which calls for practical methods to translate principles into practice [69]. Hence, we argue that it is not enough to merely observe the occurrence of unethical behavior via several examples in OSS projects, it is much more important to systematically study their characteristics, and design practical tools that can automatically detect unethical behavior by presenting evidences using data in OSS projects to stakeholders.

To bridge the gaps between general ethical principles and OSS practices, we present the first study of the types of unethical behavior in OSS projects from stakeholders’ perspectives. Specifically, our study aims to answer the research questions below:

(RQ1) How does stakeholder in OSS projects define unethical behavior, and what are the types of unethical behavior?

By referring to ethical principles, we study the diverse types of unethical behavior, their characteristics, and the corresponding ethical principles that drive these unethical behavior in OSS projects.

(RQ2) Given the type of unethical behavior, what is the corresponding type of software artifacts that are deemed as unethical by stakeholders of OSS project?

For each type of unethical behavior, we study the affected software artifacts (i.e., artifacts in which the stakeholders claimed to violate ethical principles) to guide our design of a tool that can automatically identify unethical behavior in OSS projects.

Our study leads to a taxonomy of 15 types of unethical behavior in OSS projects. including S1: No attribution to the author in code, S2: Soft forking, S3: Plagiarism, S4: License incompatibility, S5: No license provided in public repository, S6: Uninformed license change, S7: Depending on proprietary software, S8: Self-promotion, S9: Unmaintained project with paid service, S10: Vulnerable code/API, S11: Naming confusion, S12: Closing issue/PR without explanation, S13: Offensive language, S14: No opt-in or no option allowed, and S15: Privacy Violation. Six of them have not been studied (i.e., S2, S6, S8, S9, S11, S12). For example, our study discovered the unethical behavior of “S8: Self-promotion” where a contributor 𝐶 deliberately opened many new pull requests 𝑃𝑅s in several popular OSS projects where the code of the 𝑃𝑅s depends on a newly released library 𝐿 in which 𝐶 is a contributor without mentioning the conflict of interests (the fact that he is promoting his own library) [1]. Another example is “S11: Naming confusion” where the developer selects a conflicting name for an artifact which is the same as existing names but stakeholders should be responsible for selecting unique names.

Inspired by our study, we propose Ethic detector (Etor), an automatic detection tool based on ontological engineering (a description of entities and their properties, relationships, and behaviors) and Semantic Web Rule Language (SWRL) rules to model software artifacts in GitHub. In summary, we made the following contributions:

• Study. To the best of our knowledge, we conducted the first study of unethical behavior in OSS projects from the stakeholders’ perspective. Our study of 316 GitHub issues/PRs from 301 projects revealed 15 types of unethical behavior with six new ones. Our study also revealed the diversity of the affected software artifacts. Our benchmark containing 316 issues with various types of unethical behavior lays the foundation for future automated approaches for detecting unethical behavior.

• Technique. We propose Etor, a novel ontology-based tool that automatically detects unethical behavior in OSS projects. We model GitHub attributes using ontologies, and design SWRL rules to check for unethical behavior in various artifacts.

• Evaluation. Our evaluation on 195,621 GitHub issues/ PRs from 1,765 repos shows that Etor can automatically detect 548 issues with 74.8% true positive rate on average.

This paper is available on arxiv under CC BY 4.0 DEED license.