This story draft by @escholar has not been reviewed by an editor, YET.

Etor Uncovers License Violations, Plagiarism, and More in Open-Source Projects

featured image - Etor Uncovers License Violations, Plagiarism, and More in Open-Source Projects
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Background and Related Work

  2. Study of Unethical Behavior in OSS

    3.1 RQ1: Types of unethical behavior

    3.2 RQ2: Affected software artifacts

  3. Methodology

    4.1 Modeling via SWRL rules

    4.2 Automatic detection of unethical behavior

  4. Evaluation

  5. Discussion and Implications

  6. Threats to Validity

  7. Conclusion and References

5 EVALUATION

We applied Etor on 195,621 GitHub issues and PRs of 1,765 GitHub repositories to address the following research questions:


RQ3: How many unethical issues can Etor detect in OSS projects?


RQ4: What are the accuracy and efficiency of Etor in its detection?


By counting the number of unethical issues in OSS projects, RQ3 provides a rough estimation of the prevalence of each type of unethical behavior in OSS projects. For RQ4, we measure the accuracy and efficiency of Etor using the following metrics:


True Positive (TP): Etor classifies an unethical behavior as a potential violation, and it is a true violation.


False Positive (FP): Etor incorrectly classifies an unethical behavior as a potential violation, and it is a false violation.


Time: The average time taken (in seconds) to detect a type of unethical behavior across all the evaluated repositories/issues.


Selection of projects/issues. As there is no prior benchmark for evaluating the detection of unethical behavior, we construct a dataset by crawling GitHub. Our goal is to select the most recent popular (most stars and most forks) OSS projects and the GitHub issues/PRs from OSS projects for evaluation. We first obtain the list of the top 2,000 OSS projects (we first get the top 1,000 projects with the greatest number of stars, and then the top 1,000 projects with the greatest number of forks) created last year (2021). After eliminating duplicated projects, there are 195,621 GitHub issues/PRs of 1,765 projects in our evaluation set. As soft forking requires two repositories as input, we obtain the pair of repositories (𝑟𝑒𝑝𝑜1, 𝑟𝑒𝑝𝑜2) by getting 𝑟𝑒𝑝𝑜1 from the top 200 projects (first 100 from most stars, subsequent 100 from most forks) from the initial list of 2,000 projects. From these 200 projects, our crawler automatically identifies 𝑟𝑒𝑝𝑜2 by searching GitHub for projects with similar names using the name of 𝑟𝑒𝑝𝑜1 as the query. At this step, our crawler found only 10 out of the 200 projects that have repositories with similar names. For each of these 10 projects, our crawler retrieves the first 10 repositories from the search results as 𝑟𝑒𝑝𝑜2, leading to a total of 10*10=100 projects for evaluating soft forking.


Ethical considerations. Before getting feedback from stakeholders, we obtained IRB approval from our institution. As calling out stakeholders for violations of unethical behavior could potentially lead to similar ethical concerns in prior work [81], we choose to evaluate Etor by (1) manually inspecting the reported issues, and (2) reporting only the types of unethical behavior with high accuracy (based on our manual analysis). To avoid violating ethical principles as in the “hypocrite commits” incident, we explicitly mentioned in each reported issue that we are researchers conducting research on mining software repositories. To reduce author bias in the manual classification of TP/FPs, we ask for help from a non-author to independently label each issue.


All experiments are conducted on a machine with Intel(R) Core (TM) i7-7500 CPU @2.7 GHz and 16 GB RAM.


Implementation. We use Protégé 5.5.0 [71] to define the ontology model. Our crawler uses PyGitHub [24] for querying GitHub.

5.1 RQ3: Number of detected issues

Table 3 summarizes the results of our evaluation. The “Type” column denotes the types of unethical behavior detected by Etor, whereas the second column is of the form 𝑥 / 𝑦 where 𝑥 represents the number of repositories/issues with the unethical behavior detected and 𝑦 denotes the total number of repositories/issues in our evaluation dataset. Overall, Etor has successfully detected at least one violation for all types of unethical behavior that we studied. As our evaluation dataset is different from the study dataset, and we have observed the occurrences of unethical behavior in both datasets, this indicates that different types of unethical behavior is prevalent in OSS projects. Table 3 also shows that “No license provided in public repository” is the most common types among the six types of detected issues. This means that a relatively high percentage of the evaluated repositories are missing license files (around 24% of the evaluated repositories if we exclude the false positives). For the issue-level detection, we observe that “No attribution to the author in code” and “Self-promotion” are the most common ones among all evaluated issues/PRs. This indicates that contributors of OSS projects tend to (1) forget to give credit to the author in their copied code snippets, or (2) promote their own repositories without mentioning they are contributors to the repositories.

5.2 RQ4: Accuracy and Efficiency of Etor

Accuracy. To evaluate the effectiveness of Etor, two raters (one author, and one non-author who is an undergraduate CS student working as a part-time student assistant) independently inspect its output. Specifically, for each violation reported by Etor, each rater determines if the violation is a true positive (TP) or a false positive (FP). The initial Cohen’s Kappa was 0.82, which indicates a high level of agreement. The two raters then meet to resolve any disagreement to reach Cohen’s Kappa of 1.0. The “True positive” and “False positive” columns in Table 3 show the results for the inspection. On average, the TP rate is 74.8%, and the FP rate is 24.8%. For repository-level detection, although Etor can only detect a small number of violations for “Soft forking”, it can detect these unethical issues with high accuracy (0% FP rate). As we consider a repository a soft-fork only if all the contents of the two repositories are the same (100% similarity), this design decision may lead to fewer violations being found but increase the accuracy of its detection. In future, it is worthwhile to study the effect of the similarity threshold on the accuracy of its detection. For issue-level detection, Etor can detect S1 with reasonable accuracy (26% FP rate).


Efficiency. The “Time” column in Table 3 shows the average time taken to detect an unethical behavior. Overall, the average time to analyze a repository is 3.1–343.1 seconds and the average time taken to analyze an issue is 4.3–5.4 seconds. This indicates that Etor can detect a type of unethical behavior relatively fast. We also observe that “Soft forking” is the most time-consuming type to detect because Etor needs to check for code similarities for all source files within the repository.


Reasons behind inaccurate detection. We manually inspect the reasons behind the FPs reported by Etor. Etor reports the highest FP rate for “Self-promotion”. Recall that Etor checks that a stakeholder St opens an issue/PR I at repository R1, and includes the other repository (R2) link (L). A true “Self-promotion” only occurs if St did not mention about being a contributor of R2. We need to manually verify this condition by reading the comments written in natural language. Hence, FPs may occur if (1) St mentioned that he or she is a contributor of R2 (e.g., “I am working on a project called the ...” in comment [25]) or (2) St wanted to ask for suggestion in using R1 for R2 (e.g., “I’d like to try your ... module in a nonmmdetection repo (...)” [31]).


There are three main reasons for FPs in “No attribution to the author in code”: (1) no actual copying occurs but a link exists (e.g., the Stack Overflow link was mentioned as references [33]), (2) Etor checks the exact link and fails to detect if the citation uses the short link of Stack Overflow, and (3) Etor matches the exact GitHub user name with the Stack Overflow user name, and fails to detect if the user name is different (e.g., GitHub user name is devinrhode2 and Stack Overflow user name is Devin Rhode [30]). For “No license provided in public repository”, FPs occur because the repository (1) has a license file that is not in the main directory (e.g., LICENSE file in the inner folder [29]), (2) has a disclaimer in README.md (e.g., “This repository is for personal study and research purposes only. Please DO NOT USE IT FOR COMMERCIAL PURPOSES.” [34]), (3) is used for education purposes (we need to manually exclude repositories for the public schools where the license is not required), (4) has no source code or data, and (5) is under an organization license and no separate license is defined for the repository [32]. For “Uninformed license change”, FPs occur because the scenario where the repository has restored the old license should not be considered a violation (e.g., the stakeholder changed the license from “Apache License Version 2.0” to “GNU GENERAL PUBLIC LICENSE Version 3” on Feb 17, and he/she restored back to “Apache License Version 2.0” on Feb 18). For “Unmaintained Android Project with Paid Service,” we found one FP because the unmaintained project is a library that an app uses instead of the app itself but the app is actively maintained. (a new version is recently released).


Stakeholders’ feedback. Apart from manually labeling the unethical issues, we also obtained qualitative feedback by reporting them to stakeholders of OSS projects. To avoid spamming OSS developers with inaccurate results, we only reported the types of unethical behavior with >=80% TP rate in our manual analysis (i.e., S2, S5, S6). For each of these reported types, we opened a GitHub issue to developers when both raters labeled it as TP. We excluded 39 issues because the project owners have disabled GitHub issues (this usually indicates that they do not accept contributions or bug reports [26]). For example, the repository [27] violates the “No license provided in public repository” rule but we cannot report this to the project owner as GitHub issues have been disabled. We also found 19 issues that were previously reported and fixed the issues before we file a bug report. In total, we have reported 392 issues, and received 83 replies (a response rate of 21.17%) from stakeholders. We carefully looked through all those responses and identified 68 (81.93%) replies as valid and 15 (18.07%) responses as invalid. Among these 68 valid replies, 39 (57.35%) have been fixed, and 29 (42.65%) have been accepted by the stakeholders of the OSS. An example valid feedback that we received is “Thank you very much for the warning. I have already added the license to the repos that didn’t have it.”. For the 15 responses that we considered as invalid, developers (1) directly deleted or closed our submitted issues without any explanations (7/15), (2) thought that the issue reporter is a software bot although we have created the issue manually and explicitly mentioned in the issue that we are researchers (5/15), (3) are not interested in getting any GitHub issues (e.g., claiming that the repository is personal) (2/15), and (4) explained that “Software is not open source but everyone or you can use my soft without license. thank you for support my soft” (1/15).

6 DISCUSSION AND IMPLICATIONS

We discuss practical takeaways and suggestions below:


Implications for stakeholders of OSS projects. By reading issues that stakeholders in OSS considered as “unethical behavior”, our study revealed that the types of unethical behavior in OSS projects are diverse (Finding 1), suggesting that stakeholders of OSS projects should be better educated to create awareness of the different types of unethical behavior when contributing to OSS projects to avoid violating ethical principles. Apart from general types of unethical behavior, our study also pinpoints six new types of unethical behavior in OSS projects (i.e., (S2) Soft forking, (S6) Uninformed license change, (S8) Self-promotion, (S9) Unmaintained project with paid service, (S11) Naming confusion, and (S12) Closing issue/PR without explanation). Some of them are related to the unique features of GitHub (e.g., “Soft forking” represents ethical concerns when forking, “Closing issue/PR without explanation” are related to closing GitHub issues/PRs, “Self-promotion” occurs due to the need to promote the popularity of one’s new repository, whereas “Unmaintained project with paid service” denotes the responsibility of an OSS project owner to actively maintain the project to support paid users). The identified new types call for considerations of the unique context of OSS projects when studying unethical behavior. Meanwhile, although most software development efforts focus on source code maintenance, our study urges OSS project owners to be responsible for the product names selection to avoid violating “Naming confusion”. As issues related to copyright and licensing are the most common ones (Finding 2), contributors of OSS projects should pay more attentions in giving appropriate credits, and selecting suitable software licenses when copying software artifacts or using library. Meanwhile, although source code is the still most common affected software artifacts (Finding 4), our study urges OSS stakeholders to be responsible for various types of software artifacts (Finding 3) to avoid violating ethical principles when uploading them to GitHub.


Implications for researchers and tool designers. As many of the identified types of unethical behavior (Finding 1) are ethical issues that frequently occur in daily life, our study provides empirical evidence that there exists some overlaps between the types that occur under general setting (e.g., “Plagiarism” and “Offensive language”) and those that are deemed as unethical behavior by stakeholders in OSS projects. Indeed, the prevalence of plagiarism is inline with prior study which reported the prevalence of the code borrowing practices in GitHub [55]. Due to the diverse types of unethical behavior, future empirical research should advance beyond the general types of unethical behavior.


While existing work mostly focus on license incompatibility [52, 60, 82], our study found new issues related to licensing. e.g., “Uninformed license change”. As these issues still occur frequently (Finding 2) and our study identified new types of issues, our study provides empirical motivations for improving techniques related to copyright and software licenses. For the newly identified types of unethical behavior, we foresee a huge potential for future research direction in: (1) conducting more in-depth study in the motivations and the common solutions behind each type of unethical behavior, and (2) introducing automated techniques that can detect and possibly resolve these issues. We believe that our taxonomy of 316 GitHub issues and our tool that uses software artifacts and data available in GitHub API lay the foundation for future approaches on automated detection of unethical behavior. Although source code is still the most common type of affected software artifacts (Finding 4), other artifacts in natural language (e.g., PR/Issue comments, product names, and website) are also common in our study (Finding 3). A promising research direction is to apply natural language processing techniques to accurately detect affected software artifacts in natural language. For example, techniques can be designed to automatically extract and recommend descriptive yet non-conflicting names (e.g., package names) to avoid “Naming confusion”. Another future direction is to design techniques that can automatically identify disclaimer-like statements to accurately detect “Closing issue/PR without explanation” (to detect the explanation for the PR/issue), and “Self-promotion” (to extract statement where the stakeholder has mentioned being a contributor).


Challenges in automated detection of unethical behavior. To provide guidelines for future research on the automated detection of unethical behavior, we discuss several challenges identified in our study and evaluation:


• As shown in our study in Section 3.2, the types of artifacts affected by the unethical behavior are too diverse. An accurate detection technique needs to support analysis of various types of artifacts, including source code, data, and websites.


• Within GitHub, we notice that discussion and announcement in GitHub spread across multiple web pages (issues, PRs, wikis, discussions, and commit logs). With the rapid growth of different types of web pages in GitHub, it poses additional challenges for automated approaches to exhaustively analyze all relevant web pages.


Some discussions of unethical behavior occur outside of GitHub (e.g., personal emails, slack channel). For example, for “Self-promotion”, we cannot check whether the stakeholder has communicated with the developers in advance through emails. Without complete information about the discussion, the detection is bound to be inaccurate.


The scope for the detection can be too broad for some types of unethical behavior (e.g., “Naming confusion”). Without a predefined scope of detection (package name collision versus app name collision), we cannot accurately detect the behavior.


• There exist ambiguities for certain unethical behavior, which makes it difficult even for human beings to reach consensus (e.g., whether the language used is offensive). In this case, an automated tool can present all relevant information to help stakeholders in making more grounded ethical decisions.


Authors:

(1) Hsu Myat Win, Southern University of Science and Technology, China ([email protected]);

(2) Haibo Wang, Southern University of Science and Technology, China ([email protected]);

(3) Shin Hwei Tan, a corresponding author from Southern University of Science and Technology, China ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks