363 reads

How to Minimize Privacy Risks in AI/ML applications

by Pushpak PujariJanuary 16th, 2023

Too Long; Didn't Read

In this age of data-first organizations, you’re most likely collecting, processing, and analyzing tons of customer data. Data breaches are increasing every year, with 1862 reported data compromises in 2021, up 68% compared to 2020. Such sensitive information falling into the wrong hands could wreak havoc to the customer’s life due to identity theft, stalking, ransomware attacks etc.

featured image - How to Minimize Privacy Risks in AI/ML applications

Privacy Matters

In this age of data-first organizations, no matter what industry you’re in, you’re most likely collecting, processing, and analyzing tons of customer data. It could be for fulfilling a customer’s service request, for legal or regulatory reasons, or for providing your customers with a better user experience through personalization using artificial intelligence or machine learning.

However, data breaches are increasing every year, with 1862 reported data compromises in 2021, up 68% compared to 2020, with 83% of those involving sensitive information (as per Identity Theft Resource Center). Such sensitive information falling into the wrong hands could wreak havoc on the customer’s life due to identity theft, stalking, ransomware attacks, etc. This coupled with the rise of privacy laws and legislations across various states has brought privacy-enhancing data processing technologies to the forefront.

Privacy vs data utility tradeoff

With AI applications such as personalization, privacy, and data, utility can be visualized on opposite sides of the spectrum. Data that doesn’t contain anything personal i.e., expose no traits or characteristics of the customers, lend no value for personalization. However, data containing personal information can be used to deliver highly personalized experiences but if the dataset ends up in the hands of any human, it can lead loss of customer data privacy. As a result, there is always an inherent tradeoff between privacy risk and the utility of that data.

Value of being privacy-first for organizations

Health Insurance Portability and Accountability Act (HIPAA), California Consumer Privacy Act (CCPA), Children’s Online Privacy Protection Act (COPPA), and Biometric Identifier Act are just a few of the many privacy-centric laws and legislations in the US. Failure to comply with such regulations can cost an organization billions of dollars in fines. For example, recently the state of Texas sued Facebook’s parent company Meta for billions of dollars in damages for mishandling and exploiting the sensitive biometric data of millions of people in the state.

Being privacy-first can help avoid huge fines, more so you can lose your license to operate as a business. In addition, there can be a massive loss of consumer trust and loyalty, brand image, and perception. Being negligent with consumers’ data privacy can demolish customer lifetime value, and affect conversions and renewals. In fact, companies like Apple have flipped the problem on its head and are using privacy as a competitive moat; a differentiator from other technology companies.

Sources of Privacy Risk in data collected by an organization

There are three key sources of privacy risk within an organization:

Raw customer data and any of its derivatives. Raw customer data can be customer-entered data such as name, address, age sex, and other profile details or data on how a customer is using the product such as page visits, session duration, items in cart, purchase history, payment settings, etc.
Metadata and logs. Metadata and logs include the location of customers, the website a product was accessed from, the IP address of the device, MAC address, service logs, logs of calls with customer support, etc.
ML models that have been trained on customer data. ML models themselves can seem like they don’t contain anything personal, but ML models can memorize patterns in the data it has been trained on. Models trained on critical customer data can retain customer-attributable personal data within the models and present customer personal data exposure risk regardless of whether the model was deployed in the cloud or on edge devices. If a malicious actor gains access to such a model, even as a black box, they can run a series of attacks to recover the personal data leading to a privacy breach.

An ML model’s security classification should be determined based on the data classification of its training data. ML model artifacts can contain plaintext customer data and the ML model itself is susceptible to privacy attacks. If an organization is running a marketplace and sharing ML models with external partners, even under NDA and data-sharing agreements, ML models present a high risk of privacy attacks.

How to identify the gaps

Here are a few strategies to identify the biggest gaps and start using Privacy Enhancing Technologies (PET) to close the gaps:

Follow the data: chart the customer data lifecycle across your organization, right from data collection or ingestion to storage, usage to deletion. A chart will help you visualize the end-to-end flow and formulate an effective strategy for the whole organization.
Create a threat map: Identify the humans, processes, and systems that have access to customer data. Where are the humans in the loop, and is it a business requirement for them to have access? Also, what are the tools used to access data, and how?
Identify the use cases: for each use case, think about the likelihood of a privacy-risk event and estimate the blast radius and categorize each threat as high, med, and low severity
Identify the drivers and define goal success criteria: start with the high-risk items first, regularly measure progress and make your way down the list till all the risk items are low.

Privacy Enhancing Technologies

Privacy Enhancing Technologies is an area of active research with tremendous advancements made in the last 5 years. Broadly PET can be classified under 2 buckets — data sanitization and privacy-preserving computation. Data sanitization techniques focus on detecting and modifying personal information to de-sensitize it. This includes techniques such as direct identifier detection and removal, Pseudonymization, K-anonymization, and Differential Privacy.

Privacy-preserving computation techniques focus on operating on private data but in a closed environment with no humans having access to it. This includes techniques such as homomorphic encryption, [secure multi-party computation](https://en.wikipedia.org/wiki/Secure_multi-party_computation#:~:text=Secure%20multi%2Dparty%20computation%20(also,while%20keeping%20those%20inputs%20private.), federated learning, and trusted execution environments (TEE). One of the most efficient ways of managing privacy risk is by doing all the processing on the edge device itself. This mitigates the biggest risk of customer data leaving their device, especially for consumer applications, and has the added benefit of giving a hyper-targeted experience in a distributed environment without burning cloud compute costs. The biggest challenge is to prevent overfitting since the volume of data may not be sufficient and affect the quality of predictions.

Privacy engineering and Privacy Enhancing Technologies is a rapidly evolving field and the best way to stay up to date is by attending conferences, reading research papers, and joining the open-source community to get the latest updates. Lastly, I recommend starting small with the simple techniques of identifying direct identifiers and removing them before moving to more complex approaches.

Also published here.