In this age of data-first organizations, no matter what industry you’re in, you’re most likely collecting, processing, and analyzing tons of customer data. It could be for fulfilling a customer’s service request, for legal or regulatory reasons, or for providing your customers with a better user experience through personalization using artificial intelligence or machine learning.
However, data breaches are increasing every year, with 1862 reported data compromises in 2021, up 68% compared to 2020, with 83% of those involving sensitive information (as per Identity Theft Resource Center). Such sensitive information falling into the wrong hands could wreak havoc on the customer’s life due to identity theft, stalking, ransomware attacks, etc. This coupled with the rise of privacy laws and legislations across various states has brought privacy-enhancing data processing technologies to the forefront.
With AI applications such as personalization, privacy, and data, utility can be visualized on opposite sides of the spectrum. Data that doesn’t contain anything personal i.e., expose no traits or characteristics of the customers, lend no value for personalization. However, data containing personal information can be used to deliver highly personalized experiences but if the dataset ends up in the hands of any human, it can lead loss of customer data privacy. As a result, there is always an inherent tradeoff between privacy risk and the utility of that data.
Health Insurance Portability and Accountability Act (HIPAA), California Consumer Privacy Act (CCPA), Children’s Online Privacy Protection Act (COPPA), and Biometric Identifier Act are just a few of the many privacy-centric laws and legislations in the US. Failure to comply with such regulations can cost an organization billions of dollars in fines. For example, recently the state of Texas sued Facebook’s parent company Meta for billions of dollars in damages for mishandling and exploiting the sensitive biometric data of millions of people in the state.
Being privacy-first can help avoid huge fines, more so you can lose your license to operate as a business. In addition, there can be a massive loss of consumer trust and loyalty, brand image, and perception. Being negligent with consumers’ data privacy can demolish customer lifetime value, and affect conversions and renewals. In fact, companies like Apple have flipped the problem on its head and are using privacy as a competitive moat; a differentiator from other technology companies.
There are three key sources of privacy risk within an organization:
An ML model’s security classification should be determined based on the data classification of its training data. ML model artifacts can contain plaintext customer data and the ML model itself is susceptible to privacy attacks. If an organization is running a marketplace and sharing ML models with external partners, even under NDA and data-sharing agreements, ML models present a high risk of privacy attacks.
Here are a few strategies to identify the biggest gaps and start using Privacy Enhancing Technologies (PET) to close the gaps:
Privacy Enhancing Technologies is an area of active research with tremendous advancements made in the last 5 years. Broadly PET can be classified under 2 buckets — data sanitization and privacy-preserving computation. Data sanitization techniques focus on detecting and modifying personal information to de-sensitize it. This includes techniques such as direct identifier detection and removal, Pseudonymization, K-anonymization, and Differential Privacy.
Privacy-preserving computation techniques focus on operating on private data but in a closed environment with no humans having access to it. This includes techniques such as homomorphic encryption, [secure multi-party computation](https://en.wikipedia.org/wiki/Secure_multi-party_computation#:~:text=Secure%20multi%2Dparty%20computation%20(also,while%20keeping%20those%20inputs%20private.), federated learning, and trusted execution environments (TEE). One of the most efficient ways of managing privacy risk is by doing all the processing on the edge device itself. This mitigates the biggest risk of customer data leaving their device, especially for consumer applications, and has the added benefit of giving a hyper-targeted experience in a distributed environment without burning cloud compute costs. The biggest challenge is to prevent overfitting since the volume of data may not be sufficient and affect the quality of predictions.
Privacy engineering and Privacy Enhancing Technologies is a rapidly evolving field and the best way to stay up to date is by attending conferences, reading research papers, and joining the open-source community to get the latest updates. Lastly, I recommend starting small with the simple techniques of identifying direct identifiers and removing them before moving to more complex approaches.
Also published here.