Spotify Study Maps How Information Spreads Through Code Reviews

Table Of Links

ABSTRACT

1 INTRODUCTION

2 BACKGROUND

2.1 Code Review As Communication Network

2.2 Code Review Networks

2.3 Measuring Information Diffusion in Code Review

3 RESEARCH DESIGN

3.1 Hypotheses

3.2 Measurement model

3.3 Measuring system

4 LIMITATIONS

ACKNOWLEDGMENTS AND REFERENCES

3 RESEARCH DESIGN

We designed this study as an observational study [1] measuring the information diffusion in code review at Spotify. The measurement is not an end in itself but serves as the foundation for our hypothesis test2 : A single empirical code review system—Spotify’s code review system, for example—with no or marginal information diffusion could not be aligned with the existing theory of code review as a communication network in general, and the theory as it stands would be falsified (reductio ad absurdum).

The theory must then be revised or reformulated more precisely (e.g., by adding limitations, constraints, or conditions). In the next subsection, we state our hypotheses and discuss how we qualitatively reject our hypotheses outside classical statistical tests.

3.1 Hypotheses

If code review is a communication network that enables the exchange of information (theory 𝑇 ) as identified by different exploratory studies [8], then information substantially spread in code review

• between code review participants (hypothesis 𝐻1) and

• between code software components (hypothesis 𝐻2) and

• between teams (hypothesis 𝐻3).

We can formulate this sentence as the propositional statement

𝑇 =⇒ (𝐻1 ∧ 𝐻2 ∧ 𝐻3) . (1)

That means our theory 𝑇 can be falsified in its universality if one of our hypotheses cannot withstand an empirical measurement. Instead of defining arbitrary thresholds for rejecting our hypothesis, we propose a qualitative rejection criterion. This implies we will reject our hypotheses based on a comprehensive discussion of the observations of information diffusion in code review at Spotify.

As for any observational study, the measurement model, measuring system, and actual measurement define the quality of the study. Therefore, we present our measurement model, the measuring system, and the actual measurement following the definitions in the International Vocabulary of Metrology [13] in the next subsections in detail.

3.2 Measurement model

A measurement model is the mathematical relation among all quantities known to be involved in a measurement. In this section, we describe the three approaches to quantifying information diffusion in code review, which are the foundation for the qualitative rejection of our hypotheses.

We use a code review network to model information diffusion in code review. We define a code review network—in its verbatim meaning—as a network of code reviews whose nodes represent code reviews and whose links indicate a reference between code reviews, explicitly and manually added by code review participants. We argue that the explicit and manual referencing by code review participants is a strong indicator of actual information exchange from one code review to another. This assumption allows us to measure information diffusion without analyzing the specific information that was exchanged and its context.

Mathematically, we model those code review networks as a directed graph 𝐺 = (𝐶, 𝑅) where

• 𝐶 is a set of vertices representing code reviews and

• 𝑅 is a set of edges which are ordered pairs of vertices representing the references between code reviews:

𝑅 ⊆ (𝑎, 𝑏) | (𝑎, 𝑏) ∈ 𝐶 2 and 𝑎 ≠ 𝑏

The direction of those edges represents the reference: The directed edge (𝑎, 𝑏) represents a code review 𝑎 referencing code review 𝑏.

Figure 2 depicts such a simple and small code review network with five code reviews linked to each other.

The relative number of linked code reviews is the first approach to quantifying information diffusion in code review and, therefore, the first input for our discussion on its significance.

In a second approach, we approximate information diffusion in code review by measuring the similarity (or dissimilarity) of code review participants, software architecture, or organizational structure in linked code reviews: The more dissimilar the set of participants, affected code components, or involved teams of the linked code reviews, the broader the information spread in code review is.

Therefore, we enhance each code review with further information for each hypothesis:

• 𝑓1 : 𝐶 → {participants} where a code review is mapped to its participants addressing 𝐻1

• 𝑓2 : 𝐶 → {components} where a code review is mapped to the affected components addressing 𝐻2

• 𝑓3 : 𝐶 → {teams} where the code review is mapped to the owning teams of the affected component addressing 𝐻3

Through those enhancements, we gain insights into information diffusion into three orthogonal dimensions: A social dimension, where information diffuses between code review participants; a software architectural dimension, where information diffuses software components under review; and an organizational dimension, where information diffuses between teams. Those orthogonal dimensions allow us to investigate information diffusion from different angles: Information may spread between components but may never leave the team boundaries since both components are owned by the same team.

After enhancing, we apply two different similarity measures based on the type of enhancement to make the linked code reviews comparable along the three dimensions:

• Code review participants and teams are sets. We apply the Jaccard index to quantify the similarity between two sets. The Jaccard index (or Jaccard similarity coefficient) for two sets 𝐴 and 𝐵 is defined by

𝐽 (𝐴, 𝐵) = |𝐴 ∩ 𝐵| |𝐴 ∪ 𝐵| = |𝐴 ∩ 𝐵| |𝐴| + |𝐵| − |𝐴 ∩ 𝐵| . (2)

• For the tree-like component structure, set-based operations fall short. Instead, we use the graph edit distance, which is a measure of similarity (or dissimilarity) between two component graphs [10]. The graph edit distance finds the minimal set of edit operations (insertion, deletion, substitution), in terms of cost, needed to transform one graph into another.3 Mathematically, we define the graph edit distance as 𝐺𝐸𝐷(𝐺1,𝐺2) = min (𝑒1,...,𝑒𝑘 ) ∈ P (𝐺1,𝐺2 ) ∑︁ 𝑘 𝑖=1 𝑐(𝑒𝑖) (3)

where P (𝐺1,𝐺2) denotes the set of edit paths transforming 𝐺1 into (a graph isomorphic to) 𝐺2 and 𝑐(𝑒) ≥ 0 is the cost of each graph edit operation 𝑒.

Both similarity measures are normalized, i.e., [0, 1]. The smaller the similarity measures, the more dissimilar the set of participants, affected code components, or involved teams of the linked code reviews. This allows us to approximate information diffusion in code review by measuring the similarity (or dissimilarity) of code review participants, software architecture, or organizational structure in linked code reviews: The smaller the similarity measures, the more dissimilar the set of participants, affected code components, or involved teams of the linked code reviews. The distribution of those similarities will indicate to what extent information spread across the boundaries mentioned before.

In Figure 3, we exemplify how we will use the similarities measures for discussion on falsifying the theory by three possible archetypes of cumulative distributions of all three similarity measures and their relation to the theory test. Aside from the two quantitative approaches, we plan to include also a visual approach. The ownership of code components allows us to cluster components per owning team, providing a more intuitive, human-comprehensive perspective.

Figure 4 uses a circular graph layout of the components grouped by the owning teams. The components are linked via the code review network 𝐺 = (𝐶, 𝑅). We hope this visualization helps identify hot and cold spots and reveals the first patterns of information diffusion. However, depending on the extent of information between components and teams, the visualization may highlight the hot and cold spots of information diffusion, but it can also be visually overwhelming in case of a massive information diffusion.

3.3 Measuring system

A measuring system is the set of measuring instruments and other components assembled and adapted to give information used to generate measured values within specified intervals for quantities of specified kinds. As common in software engineering, our measuring system is a data extraction and analysis pipeline.

Since our measuring system is not trivial, involves a lesserknown GitHub API endpoint, and requires different data sources, we describe our measuring system in this dedicated section. Figure 5 provides a high-level overview of our measuring instrument, which we describe in detail in the following.

The first data source for our measuring instrument is the GitHub Enterprise instance and its REST or GraphQL API. For our measurement, we follow the REST API. In GitHub, a pull request is a code review. GitHub automatically tracks4 when a user references an issue and pull requests in such. Since internally, a code review is an issue in GitHub, we can tap the GitHub REST API endpoint for timeline events of issues5.

The timeline events contain all events triggered by activities in a pull request or issue, including the automated links to other pull requests or issue. GitHub’s event endpoint /events is not suitable for extracting the event data because this API endpoint returns only a maximum of 300 events and only for the last 90 days6 . The outcome of the crawling is a list of all events.

Tapping the timeline events API requires the related pull requests. The GitHub search is not suitable for including or excluding pull requests since it limits its results to 1000 results per search, which is not enough at Spotify’s scale. Therefore, we had to collect all pull requests from all repositories from all teams from GitHub.

We need the pull request information for two further steps:

• For each pull request, we also extract all files in a pull request to map those files to components in later steps.

• Since there is pull request creation event available7 , we add those information from the pull endpoint.

We then filter the list of events according to the sampling frame and exclude all events from bots.

After filtering, we extract

• all events of type reference8 and its payload, the referenced pull request (code review) which results in a code review network 𝐺 = (𝐶, 𝑅), the first input of our measurement model, and

• all human participants grouped by each code review which results in the mapping of code review to its participants 𝑓1 : 𝐶 → participants, a second input for our measurement model.

We believe that the GitHub referencing system is a reliable source. Two studies rely on this referencing system in GitHub [14, 23]. However, both use the so-called -mentions that reference a user but not the references to issues or pull requests.

The second source for our measurement model is the software architecture description tracking all components. Spotify uses a tool called Backstage9 for tracking its software architecture. For each pull request, we extracted all files and mapped the files to components. A software component is a self-contained, reusable piece of software that encapsulates the internal construction and exposes its functionality through a well-defined interface so other components can use the functionality.

Software components can take many forms, including libraries, modules, classes, functions, or even entire microservices or applications. Components are hierarchically structured and may contain files or recursively other components. At Spotify, the component structure maps to the virtual folder structure of the source code. That means software components are specific folders that contain files.

Since the component structure evolves over time, we map the files to the component structure at the time when the code reviews are referenced. Therefore, we use the available historical daily snapshots of the software architecture at Spotify.

To identify the component of the files in a pull request efficiently, we create a file graph reflecting the paths of all changed files per code review and a time-varying component graph reflecting the component structure for each given day. The leaves of the intersection of both graphs represent the components for the files changed in a pull request. Figure 6 sketches the intersection of both graphs.

This mapping code reviews to components 𝑓2 : 𝐶 → components is the third input for our measurement model. For each identified component, we also identify its owner. Component ownership refers to the concept of assigning responsibility and accountability for a particular software component to an individual or an organizational unit within an organization. Spotify

uses weak code-ownership [20]. The mapping code review to owner 𝑓3 : 𝐶 → teams is the fourth input for our measurement model.

3.4 Measurement

The measurement is the process of experimentally obtaining values that can be reasonably attributed to a quantity together with any other available relevant information. For our measurement, we use Spotify’s internal GitHub Enterprise and the Backstage instance. It comprises all Spotify-internal code reviews and components.

We will run our measurement in 2024. Our sampling frame is one year and includes the timeframe [2019-01-01, 2019-12-31]. This timeframe, outside of the ongoing developments at Spotify, allows us to publish all data in an anonymized way. However, the extent of information diffusion we will find might require us to shorten the timeframe.

Authors:

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.