Automating Privacy Code Reviews by Mapping How Software Handles Personal Data

Table Of Links

Background

Code review, originally aimed at ensuring software quality by identifying bugs and performance issues [11], has expanded to address security vulnerabilities and, more recently, privacy concerns under data protection laws like the GDPR. Privacy-focused reviews add the complexity of ensuring personal data is handled lawfully and ethically, a challenging task due to the often ambiguous nature of data protection guidelines [10].

Static analysis tools are pivotal in code reviews, aiding in the identification of data flows, security risks, and compliance issues. The effectiveness of a review is measured by its ability to pinpoint critical problems and offer actionable solutions. Privacy code reviews, however, struggle with identifying personal data due to unclear definitions and varied contexts, increasing reliance on these tools despite their limitations in recognizing diverse personal data types [9].

These reviews also play a key role in creating essential compliance documents like Records of Processing Activities (ROPA) and Data Protection Impact Assessments (DPIA). The proposed automated approach in this paper focuses on improving the efficiency and accuracy of privacy code reviews, specifically in categorizing personal data processing in large-scale code projects.

Privacy-Relevant Methods

To streamline the process of privacy code review, we introduce the concept of privacy-relevant methods. These are specific methods that play a direct role in the processing of personal data. Such methods can be part of standard libraries or third-party libraries, making them critical focal points for personal data processing in software applications. Native libraries are foundational because they offer the only pathways to device resources like files and networks.

Consequently, any operation involving data storage or transfer must go through these native methods. Native privacy-relevant methods are those found in standard libraries of programming languages like JavaScript and Java. These methods act as the origins (sources) for all personal data entered by users via devices. They are also the exclusive methods that directly transmit this data to other devices or services. We categorize these native methods into domains such as I/O, Database, Network, Security, following the guidelines of existing research [8].

We identify these methods through a systematic manual review that includes an examination of documentation, source code, and actual usage patterns. To facilitate the identification and categorization of native privacy-relevant methods, we conducted an in-depth analysis of key modules like java.io, java.security, and java.util for Java, and their equivalents in JavaScript. This analysis helps us compile a complete set of native privacy-relevant methods, denoted as Native, that are involved in personal data processing.

Authors:

Feiyang Tang
Bjarte M. Østvold

This paper is available on arxiv under CC BY-NC-SA 4.0 license.