Inside a Practitioner Survey on Modern Code Review Priorities

Table Of Links

5 SURVEY RESULTS

In this section, we report on the results that answer RQ2 — How do practitioners perceive the importance of the identified MCR research themes? — based on 465 statements that we derived from the identified five main themes on MCR research.

5.1 Demographics

We received 28 responses in total. We excluded three respondents as they did not have any code review experience or entered invalid responses. The remaining 25 respondents work in different roles in large multinational organizations; 56% of the participants are working in Swedish software organizations. The company name was not a mandatory field in the survey. However, approximately 70% of the respondents provided their company name.

Most of the respondents are from the telecommunication domain. We also received responses from practitioners working in productbased companies and IT services and consulting companies. We received one response from the insurance domain. The respondents in the other category are those who previously worked in software companies and worked, at the time of the survey, in academia. Each respondent provided

a rating for each of the 46 statements (25x46) and six explanations for the three statements in most positive and most negative ratings, respectively (25x6), resulting in 1300 data points. The demographic information of the participants (their role, and experience in development, and code review) is provided in Table 8. The respondents have varying experiences from two to 30 years and work in 10 different roles such as developer, architect, and tester.

Moreover, 60% of the participants have a Master’s degree, 30% a Bachelor’s degree, and 10% have a Ph.D. degree. The respondents who provided their company names covered four different domains and seven different large companies. The details that could trace back to the respondents, such as their names and company names, are not provided to ensure the confidentiality of our respondents.

5.2 Agreement level - Rating of statements

The agreement levels of 25 participants is illustrated in Figure 6. The vertical axis shows the 46 statements, while the horizontal axis shows the percentage on the agreement levels. The statements are sorted from most to least agreement.

Three out of the top five statements are related to the impact of and on code reviews. In addition, the benefits of code/test reviews have high positive ratings; 92% of the respondents agreed on ’It is important to investigate the impact of code reviews on code quality’. Similarly, four of the last five statements are on investigating the difference between core and irregular reviewers. None of the participants agreed with Statement ’It is important to investigate the difference between core vs irregular reviewers in terms of the level of agreement between reviewers’.

The statement ’It is important to investigate support for code reviews on touch-enabled devices’ received the most negative ratings. However, we can see from Figure 6 that there is no consensus on most of the statements. In other words, most statements received both positive and negative as well as neutral responses. In some cases (35% of the statements) the difference in positive and negative ratings is not vast (less than 20%). For example, ’It is important to investigate support for determining the usefulness of code reviews’ has 40% negative, and 36% positive responses which is a difference of only 4%. A deeper look at the differences is needed.

We grouped the rating on the five themes to see how the ratings vary within each. As seen in Figure 7, the theme “Impact of code reviews on product quality and human aspects” (IOF, see Figure 7a) has a received the most positive response. However, within the theme, the impact of code reviews on product quality (i.e., code quality, security issues, software design and defect detection or repair) received a more positive response compared to human aspects, particularly on developers’ attitude and peer impression.

This indicates that practitioners perceive that research on the impact of code reviews on the outcome (e.g.,quality) is more important than research on human aspects. This observation is further corroborated by the fact that the theme “Human and organizational factors” (HOF) had the second least agreement, short of the theme “Support systems for code reviews” (SS). Figure 7b depicts the ratings on the theme “Impact of software development processes, patch characteristics, and tools on modern code reviews” (ION). More than 50% of the respondents perceive the investigation of code change description, continuous integration, code size changes, and static analysers on code review process to be important.

On the other hand, the impact of commit history coherence, fairness, review participation history, and gamification on code reviews are not considered as important. We observe as well in the ION theme that practitioners are more negative towards research on human aspects such as impact of fairness in code reviews, which received the highest negative rating in the ION theme. The investigation in the theme “Modern code review process properties” (CRP) is considered important, especially research on the investigation of benefits, challenges, and best practices, (see

Figure 7c). However, some of the topics such as the process for distributing review requests, and merging pull requests were not considered as important. In the theme “Human and organizational factors” (HOF), research on the effect of the number of involved reviewers, reviewers’ information needs, reviewers perception of code and review quality, and reviewers understanding each others’ comments was by the majority perceived as important. However, 72% of the respondents did not agree on the need to investigate reviewers’ career paths

and social interactions, as seen in Figure 7d. In addition, 68% of the respondents did not perceive research on the reviewers’ age and experience to be important. In the theme “Support systems for code reviews" (SS), only research on support for understanding what changes need review and selection of appropriate reviewers was perceived as important, as shown in Figure 7e. Support for code reviews on touch-enabled device received most negative response where 72% of the respondents gave negative ratings. It is rather surprising that this theme received the least agreement overall, given that it is the theme with the majority of publications.

When looking at the statements grouped in themes, there is a clear trend for the practitioners’ preference on research that investigates causal relationships between code reviews and factors relevant for software engineering in general (themes ION and IOF). There is also a strong interest on modern code review process properties. Surprisingly, research on human and organizational factors as well as support systems for code reviews was not perceived as important by practitioners, which together represent nearly 70% (164 out of 244) of the primary studies from our mapping study.

5.3 Factor analysis

We further analyzed the survey data to identify patterns in the respondents’ viewpoints using factor analysis, as suggested by the Q-Methodology. In the survey, we asked respondents to put only a fixed number of statements per rating for example, only 3 statements in each of the -3 and 3 ratings. However, due to an error in the survey tool, four respondents could put more than the desired statements in some ratings. Therefore, we only included 21 out of 25 valid participants responses in Q-method analysis.

As mentioned in Section 3.2, the participants rate the statements which is represented in a Q-Sort. For example a Q-Sort of one participation for 46 statements is as follows (-3 3 2 3 2 -3 2 0 -3 2 0 0 -2 -2 0 2 -2 1 -1 3 0 1 -1 1 -1 1 0 0 -1 1 -1 -1 0 -1 1 -1 1 1 1 0 0 -2 0 -2 -1 0), where each value is the rating given by the participant for the statement. The Q-Sorts of all participants is used as input for factor analysis. The steps followed in the Q-method analysis are (see the result of each intermediate step in the Q-method report available online [6]):

(1) Creating an initial matrix - An initial two-dimensional matrix is created (statements x participants), where the value of each cell is the rating given by the participants (between -3 to 3).

(2) Creating correlation matrix - A correlation matrix between each Q-Sort (i.e., participant ratings) is generated using Pearson correlation coefficient test.

(3) Extracting factors and creating factor matrix - New Q-Sorts called factors are extracted which are the weighted average Q-Sorts of all participants with similar ratings. A factor represents Q-Sort of a hypothetical participant representing a similar viewpoint. We used principal component analysis (PCA) to extract the factors. The two-dimensional factor matrix is created (participants x factors). The value of each matrix cell is the correlation between the participants Q-Sort and the factors called factor loading. A higher loading value indicates more similarity between the participant and the factor.

(4) Calculating rotated factor loading - To clarify the relation among factors and increase explanatory capacity of the factors resulting from PCA, we conducted varimax factor rotation. Only a few factors are selected that represent the maximum variance. We used both a quantitative and qualitative approach to find select the number of factors. The quantitative criteria recommend in the literature are as follows [11]: a minimum of two loading Q-Sorts are highly correlated to the factor; (2) the composite reliability is greater or equal to 0.8 for each factor; (3) eigenvalues are above 1 for each factor, and (4) the sum of explained variance percentage of all the selected factors should be between 40% and 60%.

As shown in Table 9, when we select three factors all of the above criteria are satisfied. We also performed a qualitative analysis and excluded solutions with more than three factors as there were few or no distinguishable statements. For example, a statement is distinguishable when its rank in one factor differs from all other factors.

(5) Finalising factor loading - The rotated factor loadings from the previous step are finalised by flagging the Q-Sorts that best represent the factors. We flagged the Q-Sorts based on the follow criteria: (1) Q-Sorts with factor loading higher than the threshold for p-value < 0.05; (2) Q-Sorts with square loading higher than the sum of square loadings of the same Q-Sort in all other factors. As seen in Table 9, the sum of number of loading Q-Sorts is 19 which means two respondents could not be significantly loaded into any of the factors.

(6) Calculating the z-scores and factor scores - The z-scores and factor scores indicate the statement’s relative position within the factor. The z-score is a weighted average of the values of flagged participants’ ratings given to a statement in the factor. Factor scores are based on ordering z-scores and mapping to the Q-Sort structure (-3 to 3); they are integer values instead of continuous. Factor scores are important for factor interpretation.

(7) Identifying distinguish statements - As mentioned in Step 4, a statement is distinguishable when its rank in one factor differs from all other factors. The factor scores from Step 5 are used to identify distinguished statements that represent the factor and used for factor interpretation. If there is a significant difference (more than 0.05) in factor score of a statement in one factor from all other factors that the statement is identified as distinguished statement.

The distinguished statements in Factor 1 is provided in Table 10, Factor 2 in Table 11, and Factor 3 in Table 12. Figure 8 provides a summary of the respondents’ experience in development and code reviewing in each factor. It is clear that respondents with more experience are grouped in Factor 1 compared to other factors. We provide an interpretation of the factors in the next subsections.

5.3.1 Factor 1 interpretation - It is important to investigate the code reviewer as a subject. Table 10 shows the distinguishing statements on Factor 1 which represents 43% of the respondents and explains 16.81% of the variance in responses. As seen in Figure 8, participants loaded in Factor 1 have more experience. They have expert/senior roles in architecture and design, an average of 16 years experience in software development, and 13 years of code review experience. Participants loaded in Factor 1 are more positive regarding the impact of code review on human factors than the ones loaded in Factors 2 and 3. For example, statements regarding the teams’ understanding of the code under review, developers’ attitude, and peer impression are perceived to be important in Factor

Regarding the impact of code reviews on teams’ understanding, one of the respondents in Factor 1 wrote: P27: "Without understanding the requirement of the code, there is no point to review the code". Another respondent was interested in the investigation of knowledge sharing, he wrote P3: "Code reviews enable knowledge sharing". Research on the impact of code review on developers’ attitude is considered important as considerable amount of effort goes in reviewing code. A participant wrote: P14: "Everyone needs to see the importance of better quality". However, respondents in Factors 2 and 3 disagree. One of the respondents in Factor 2 wrote: P26: "this is more of an individual’s approach

towards any work. Once a reviewer is made to follow the correct set of principles, this [investigation on developers’ attitude] can be eliminated". All respondents display a neutral or even negative attitude towards the importance of investigating the impact of peer impression on code reviews. They feel that people should be objective and not be influenced by peer impressions. One of the respondents in Factor 1 wrote: "P1: It should not be necessary to do research on the obvious fact that people should be responsible". Similarly, respondents in Factor 1 are less negative compared to F2 and F3 about the importance to investigate the difference between core and irregular reviewers in terms of their career paths.

On the other hand, respondents in Factor 1 are more negative compared to F2 and F3 about the importance to investigate support for understanding the code changes that need review. One of the respondents wrote: "P14: Everything should be reviewed, this is a non-question". However, the practitioner interpreted the question as "understanding what should be reviewed" rather than understanding the code under review. Despite the potential misinterpretation, this statement has been ranked as most important statement in the solutions theme (see Figure 7e).

5.3.2 Factor 2 interpretation - It is not important to investigate human aspects related to code review. Table 11 shows the distinguishing statements on Factor 2, which represents 24% of the participants and explains 15.43% of the variance in responses. The respondents grouped in this factor have less experience compared to the respondents in Factor 1 and 3 (see Figure 8). They have roles in development and testing with an average of 6 years experience in software development and 4 years of code review experience. Respondents in this factor are more positive about research on the impact of code review on defect detection or repair and impact of continuous integration than research on human factors.

On the importance of defect detection or repair, one of the respondents wrote: P18: "This [defect detection] is generally why code reviews take place - it is interesting to perform a more formal causal analysis on this [the impact of code reviews on defect detection]". Respondents do not see the importance of investigating human aspects unlike in Factor 1, where respondents with more experience are positive towards investigations on human factors. In this factor, more importance is given to having good code review guidelines as stated by one of the participants: P25: "Standard review procedure should be independent of individual/team members’ age and experience".

5.3.3 Factor 3 interpretation - It is more important to investigate the support for optimizing code reviews than support for analyzing human aspects. Table 12 shows the distinguishing statements on Factor 3, which represents 24% of the participants and explains 15.07% of the variance in responses. Respondents in this factor have mainly testing roles and an average of 9 years experience in software development and 5 years of code review experience. Overall respondents in Factor 3 are less positive about research on the impact on and of code reviews and code review process. They are more interested in research on the support for optimizing code reviews than analyzing human aspects. We did not get any explanations for the ratings as most of the ratings are between -2 to 2. For the -3 rating the respondent had no comments.

Authors:

DEEPIKA BADAMPUDI
MICHAEL UNTERKALMSTEINER
RICARDO BRITTO

This paper is available on arxiv under CC BY-NC-SA 4.0 license.