This story draft by @escholar has not been reviewed by an editor, YET.

Soft Forks, Silent License Changes, and Self-Promo: Etor Sees It All

featured image - Soft Forks, Silent License Changes, and Self-Promo: Etor Sees It All
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Background and Related Work

  2. Study of Unethical Behavior in OSS

    3.1 RQ1: Types of unethical behavior

    3.2 RQ2: Affected software artifacts

  3. Methodology

    4.1 Modeling via SWRL rules

    4.2 Automatic detection of unethical behavior

  4. Evaluation

  5. Discussion and Implications

  6. Threats to Validity

  7. Conclusion and References

4.2 Automatic detection of unethical behavior

We designed Etor to auto-detect six types. We excluded nine types because (1) they involve artifacts (e.g., product names, software features) that are difficult to automatically isolate from other artifacts (i.e., “No opt-in or no option allowed”, “Privacy Violation”, “Naming confusion”, and “Offensive language”), (2) they require sophisticated analysis of configuration files, API or source code (i.e., “Plagiarism”, “Depending on proprietary software”, and “Vulnerable code/API”), (3) their detection requires advanced natural language processing (i.e., “Closing issue/PR without explanation” as it requires automatically checking if the explanation for closing the PR/issue exists), and (4) approaches for “License incompatibility” [52, 60, 82] exist so we exclude it to avoid reinventing the wheels.


Figure 3: Our ontology of unethical behavior in OSS projects


Figure 4: Overall architecture of Etor (GH denotes GitHub).


Overview of Etor. Figure 4 presents the overall architecture of our automatic detection tool, Etor. Etor supports detection of unethical behavior for two levels, including: (1) repository (denoted as repo), and (2) GitHub issue/pull request (we denote an issue as issue and a pull request as PR). Given a repo or an issue/PR, and the type of unethical behavior eType to be checked, the Etor relies on its set of SWRL rules for its detection, and produces as output whether there is a violation of eType in the given input. Apart from GitHub attributes in Table 2 that can be detected using the GitHub API, our SWRL rule reasoner uses two additional components for its detection: (1) license detector that checks for licenses at the repository level, and (2) code similarity checker that identifies similar code.


Supported types. Etor supports six types of unethical behavior. We include the SWRL rules for all supported types in the supplementary material. We next describe how Etor checks each supported type. (S1) No attribution to the author in code. Etor checks if an issue or a PR has a Stack Overflow link representing a reference code, and the code snippet copied from Stack Overflow cites the reference link. Although there can be many resources from which stakeholders copy the reference code, Etor only check for Stack Overflow links because (1) we learned from our study and from existing work [42] that contributors are required to give credit to copied code snippets in Stack Overflows as they are protected by the CC-BY-SA Creative Commons license, and (2) to support other online resources (e.g., GitHub links), we need to automatically extract the original reference code (requires parsing Web pages of different formats), and identify the appropriate license for the code snippet (requires detecting the license for partial code, which is beyond the scope of this paper). Given an issue/PR, Etor checks if a comment b in the issue/PR posted by a stakeholder u1 contains the Stack Overflow link (w) (we use regular expression to extract w). Etor reports a potential violation if: (1) u1 is not the owner of the Stack Overflow comment, (2) the code snippets from Stack Overflow is found in one of the files in the repository (F) with at least 10% similarity (copyright law permits the use of up to 10% of work without permission [20]), and (3) w is not found in F.


(S2) Soft forking. Given two repositories r1 and r2, Etor compares the contents of all source files in the two repositories to check if one repository is a soft-fork (the repository has the same content but it is not listed as an official fork of another repository) of another repository. Specifically, we use AC2 [21] to detect the similarities between files. AC2 is a source code plagiarism detection tool that has been widely used by graders to detect plagiarism within a group of assignments. We select AC2 because (1) it supports many programming languages (e.g., C, C++, Java, and PHP), (2) it can be run in a local environment without connection to remote servers, and (3) it is quite robust as it incorporates multiple algorithms found in the literature. Etor reports a violation if it detects: (1) 100% similarity between r1 and r2, and (2) r2 is not in the fork list of r1. (S5) No license provided in public repository. Given a repository r, Etor detects the repo-level license by checking if it exists in the: (1) LICENSE file [22] in the main directory of r, (we check only in the main directory to avoid mistakenly finding API license or package license) or (2) README.md file with license information (we use the list of licenses provided by GitHub [23] for repo-level license detection). Etor reports a potential violation if no license is found after searching for the two files.


(S6) Uninformed license change. We consider a change to be uninformed if (1) it is not announced in the CHANGELOG.md or (2) the license change is not done via PR. Given a repository r, Etor checks if the repo-level license has been changed by: (1) extracting commit lists of the license file, and (2) checking if commit changes include license updates. If the license changes occur in more than one commit (we ignore the first commit as it is the initial license creation), Etor checks whether the changes have been announced in the CHANGELOG.md by checking whether the CHANGELOG.md mentions license information. If license information is not found, Etor checks the PR count for the commit (pullRequestCountByCommit). If the count is less than one, Etor marks it as a potential violation.


(S8) Self-promotion. We consider self-promotion to be the scenario where a contributor u opens a GitHub issue/PR where the content of the issue/PR includes links to another repository in GitHub to promote his or her own repository. Given an issue/PR for r1 as input, Etor first (1) checks that the issue/PR includes a link L to another repository r2, and (2) identifies the stakeholder u who opens the issue/PR. Then, it reports a violation if: (1) r1 is not r2, (2) u is not a contributor of r1 (i.e., u is an outsider for r1), and (3) u is a contributor of r2. To reduce false positives, Etor also checks if L includes specific keywords that usually indicate that the contributor is sharing the link L for demonstration purposes (e.g., [DEMO]) instead of promoting a repository/library (“\issues\”, “\pull\”, “\commit\”, “\tree\”, “\releases\”, “\blob\”, and “\runs\”).


Table 3: Number of issues detected and TP/FP rate


(S9) Unmaintained Android Project with Paid Service. This type checks whether an Android project offered paid service in Google Play, but stop actively maintaining the GitHub repository. On average, 115 APIs are updated per month [65], and 49% of app updates have at least one update within 47 days [67]. Based on this frequency of app updates, we define an unmaintained Android project to be an Android project where the latest update is released less than 0.5 year. Given a repository r as input, Etor first checks for unmaintained Android projects by examining whether (1) the latest release date (D) of r is less than 0.5 year, and (2) r is an original repository (not forked from other repositories). Then, it checks whether the app offers a paid service by (1) identifying the Google Play link l from r, and (2) searching for the “in-app purchase”.


Authors:

(1) Hsu Myat Win, Southern University of Science and Technology, China ([email protected]);

(2) Haibo Wang, Southern University of Science and Technology, China ([email protected]);

(3) Shin Hwei Tan, a corresponding author from Southern University of Science and Technology, China ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks