In today’s data-driven world, where corporations gather and leverage vast quantities of personal information, the importance of customer privacycannot be overstated. Preserving the confidentiality of clients is not solely a legal requirement but also a fundamental ethical obligation.
A method of sharing data that describes patterns in a dataset while hiding personally identifiable information is called differential privacy. For example, a variety of organizations may release statistical or demographic information, but due to differential privacy, it is hard to determine the precise contribution of any given person. The concept is that regardless of whether a certain piece of data was included in the dataset or not, a researcher would still get the same query answer. If a data scientist is unable to identify a specific person using the data, then the system has differential privacy.
In this article, we will learn more about differential privacy and how to safeguard customer privacy by integrating Differential Privacy withVersatile Data Kit (VDK).
In the modern digital world, we need more data than ever to help make and validate business decisions. Without data, no business will thrive in the digital landscape; using data, we can train machine learning models, predict user choices, and even serve advertisements. Users are not feeling safe due to access given to businesses of their private data. To gather huge amounts of data from users, we need to provide better privacy guarantees and meet law enforcement standards (e.g., HIPAA, GDPR) to assure users that their data will be safeguarded.
When people try to hide or disguise personally identifiable information (PII), they may unintentionally leave behind other identifying elements in the data. These additional elements are known as quasi-identifiers. So, merely obfuscating PII may not be sufficient to protect privacy if these quasi-identifiers (see image below) can still be used to identify individuals.
To solve this problem Differential Privacy arrives as the better solution. It is a mathematical framework that offers robust assurances regarding the privacy of an individual’s data, even when integrated into extensive datasets or merged with other data sources. This technique involves introducing random noise to the data before analysis or sharing, significantly increasing the complexity for potential attackers to pinpoint the specific data of an individual. The noise is strategically incorporated to maintain statistical accuracy, ensuring that the outcomes of the analysis remain meaningful and valuable despite the introduced noise.
There are three main actors in differential privacy: the curator, owner, and analyst of the data. As you can see in the image below, global differential privacy is where we can trust the curator but not the analyst, whereas in local differential privacy, we don’t trust anyone.
Random Response: It is used when the datatype we are trying to obfuscate is Boolean. When a query about a Boolean attribute (such as whether an individual has a given characteristic) is made using this approach, the result is randomized to introduce uncertainty.
Check here or see the example below:
class DifferentialPrivateRandomResponse:
def __init__(self, random_response_frequency: int):
self._random_response_frequency = random_response_frequency
def privatize(self, value: bool):
# first coin flip
if np.random.randint(0, self._random_response_frequency) == 0:
# answer truthfully
return value
else:
# answer randomly (second coin flip)
return np.random.randint(0, 2) == 0
Unary Encoding: It is used to add noise when we are trying to add privacy in Enum-type data types. Unary encoding is a way of representing categorical data in a vector.
Check here or see the example below:
def _perturb(self, encoded_response: List[int]) -> List[int]:
return [self._perturb_bit(b) for b in encoded_response]
def _perturb_bit(self, bit: int) -> int:
sample = np.random.random()
if bit == 1:
if sample <= self._p:
return 1
else:
return 0
elif bit == 0:
if sample <= self._q:
return 1
else:
return 0
With the increasing complexity of data management, the open-source Versatile Data Kit (VDK) empowers organizations to handle and secure sensitive data. Leveraging VDK’s capabilities, we can address the challenges of implementing Differential Privacy. Learn more about the Versatile Data Kit here!
We will dig deeper into each and every step required to implement Differential Privacy in detail. We will work with a patient dataset commonly used by researchers as an example to understand the implementation of Differential Privacy using VDK.
Data Ingestion: VDK provides a clean interface for ingestion.
VDK is modular and highly extensible, it has the concept of plugins that can be installed like any other python packages. Once they are installed, we can plug them into a VDK job by quickly changing config. To enable differential privacy, we need to intercept data at pre ingestion step so we can add noise before we sync it. VDK plugins can intercept data at many different points in the data streaming lifecycle.
Random Response Plugin: Let’s see an example of how we can configure and add random response noise to Boolean data type. Consider a study being conducted by some researchers to determine the influence of smoking on cancer. They must study data from various patients while also protecting the patients’ privacy through the use of differentiated privacy and VDK. Check code here.
To install and configure our new plugin Random Response, we need to run the following:
pip install vdk-local-differential-privacy
After installing the Random Response plugin, we need to update the config file:
# update config
[vdk]
ingest_method_default=SQLITE
#add preprocessing step
ingest_payload_preprocess_sequence=random_response_differential_privacy
#set property specific to this plugin
differential_privacy_randomized_response_fields='{"patient_details": ["is_smoker"]}' )
Follow the above code and see how we can add a preprocessing step in the config file. In the next step, we set properties specific to this plugin, and the fields we want to randomize or add noise to are located in the “patient_details” table, and the name of column “is_smoke”.
from vdk.api.job_input import IJobInput
def run(job_input: IJobInput):
#60 people who are not smokers
for _ in range(60):
obj = dict(str_key="str", is_smoker=False)
#setup the configuration
job_input.send_object_for_ingestion(
payload=obj, destination_table="patient_details", method="memory"
)
As you can see in the script above, we took 60 patients who are not smokers and saved all their information to the database. It generates a dictionary and sends it for ingestion into a data table named “patient_details” using the “memory” method. Since everyone is not a smoker(is_smoker=False) in the dataset, there is always the same amount of noise in the dataset.
As you can see in the histogram of noisy and randomized data, there are ~45 non-smokers and ~15 smokers.
We are using a random response plugin, so the genuine value can be reported with a specific probability, whereas the false value can be reported with the complementary probability.
Understandable data: VDK helps to create noisy data. We can move from noisy data to actual distribution before the noise is added to the data. Moving from noisy data to understandable data involves managing and filtering out the noise.
To achieve this, a few points need to be considered:
- Approximately half of the data consists of pure noise.
- About one-fourth of the data is composed of “yes” responses generated from random noise.
To determine the number of real “yes” responses in the data, you subtract the noise-generated “yes” responses from the total number of “yes” responses. Mathematically, this is expressed as:
real yeses = total number of yeses — (1/4 x dataset size)
Since half of the data was discarded due to noise, it’s necessary to adjust for this loss when estimating the actual count of real “yes” responses. Doubling the number of real “yes” responses compensates for the elimination of half the data:
Actual Count of Real Yeses = 2×Real Yeses
This process helps in obtaining a more accurate representation of the true positive responses within the dataset, accounting for the presence of noise and ensuring a better understanding of the underlying information.
By doing these steps we can get real distribution. Check right histogram in image below, we achieved the number 60 and has very less or no margin of error likely to happen.
A similar method will be used as a random response in implementing differential privacy using the Unary Encoding VDK Plugin.
To install and configure our new plugin, Unary Encoding, we need to run the following:
pip install vdk-local-differential-privacy
After installing the Unary Encoding plugin, we need to update the config file:
# update config
[vdk]
ingest_method_default=SQLITE
#add preprocessing step
ingest_payload_preprocess_sequence=unary_encoding_differential_privacy
#set property specific to this plugin
differential_privacy_unary_encoding_fields='{"patient": {"blood": ["A","B","AB","O"]}}'
Follow the above code and see how we can add a preprocessing step in the config file. In the next step, we set properties specific to this plugin “unary_encoding_differential_privacy,” and in the next step, we want to add unary encoding on the “patient” table in column “blood” with blood groups as Enum values.
from vdk.api.job_input import IJobInput
def run(job_input: IJobInput):
for _ in range(50):
obj = dict(str_key="str", blood_type="B")
job_input.send_object_for_ingestion(
payload=obj, destination_table="patient", method="memory"
)
In the above script, you can see that we have 50 patients with blood type “B” and save it to the database. Implementing unary encoding using the Versatile Data Kit plugin to implement differential privacy is somewhat similar to the random response plugin method.
Basically, random response introduces randomness to protect privacy in statistical data, while unary encoding is a binary representation method commonly used for categorical data.
In concluding this article for safeguarding consumer privacy with Differential Privacy and the Versatile Data Kit (VDK), we emphasize the critical importance of ethical data practices. Balancing privacy and creativity needs collaborative efforts, and the combination of these tools provides a strong framework for responsible data management. As organizations negotiate this changing world, they must embrace openness, adapt to legislation, and prioritize privacy.
The integration of Differential Privacy and VDK not only protects client privacy but also lays the groundwork for a trustworthy and responsible digital future. VDK is also working on providing support for differential privacy in SQL queries and global differential privacy.
This article is co-authored by Astrodevil and Paul Murphy, combining their expertise to provide a well-rounded perspective on the topic.
💡Check Versatile Data Kit GitHub Repo
💡Check
💡Check the Getting Started guide of VDK to learn more
💡Check VDK plugin files
Also published here.