Hamza Harkous

@hamzaharkous

We Gave Privacy Policies an AI Overhaul, and You’ll Never Have to Read Them Again!

February 10th 2018

Summary: This is the backstory of how we built pribot.org. There is a Wired article about the project already. This article provides our story of what went behind the scenes.

Whenever someone asks me what I work on, I find nothing better than a single link to a YouTube series of 9 hours filming an actor reading Amazon Kindle’s terms and conditions.

Yes, you read it correctly: 9 HOURS.

Here is the trailer:

And this is a link to the full series.

Hopefully, you’re still with me at this early stage and you didn’t get lost in an endless spiral of similar videos recommended by YouTube, such as people streaming themselves reading privacy policies of Google, Facebook, and Apple, for hours.

Actually, the privacy policy, the equally annoying sibling of terms and conditions, is the subject of this article.

Have you thought how long it would really take to read all the policies for services we use per year? That would be 201 hours according to a research by McDonald and Cranor in 2008.

So given the choice between this unpaid, exhausting task and anything else, it’s no wonder that people prefer spending time fulfilling their yearly exercising goals, reading an amusing book, or just relaxing. This is even when they hear frightening stories of what’s inside privacy policies (like this and this).

Researchers have tried to make these policies simpler, primarily by manual methods, like websites offering standardized versions of their policy, inspired by nutrition labels. Approaches relying on the wisdom of the crowd have received good traction too (like the “Terms of Service; Didn’t Read” project). Yet, these attempts didn’t scale due to the huge manual effort and human expertise involved.

Fixing Privacy Policies: the Backstory 🛠

Two years ago, my colleagues and I wrote a paper for a workshop on the future of privacy notices. In it, we proposed a vision for turning privacy policies into a conversation via a chatbot called PriBot.

Let’s admit it: 2016 was the year of chatbots, and we were — shamelessly — motivated by the hype. After all, the best interface is no interface. Right?

Our idea was that you could ask PriBot about any privacy policy like you ask Siri for the capital of Ivory Coast (spoiler: Yamoussoukro). PriBot would then respond in real time with an answer from the policy itself.

Fast forward ⏩

Twenty months later, we have taken our vision to fruition in a new research paper, in which we first show how we realized the goal of automated Question Answering for privacy policies. This is a collaboration between Hamza Harkous (yours truly), Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin and Karl Aberer.

As our first research outcome, we’re introducing PriBot to the public, via a chatbot that you can converse with right now at pribot.org/bot:
PriBot in action

On the way to build PriBot, we had a surprising byproduct, which even has the potential to make a wider impact: we built a general system for automatically analyzing privacy policies using machine learning. We call it Polisis.

Polisis gives you a glimpse about of privacy policy, like the data being collected, the information shared with third parties, the security measures implemented by the company, the choices you have, etc. All that without having to read a single line of the policy itself.

We’re releasing Polisis too at pribot.org/polisis. You can alternatively download its Chrome extension or Firefox addon to analyze websites in a click.
Polisis in action

To our knowledge, Polisis is the first system to provide such in-depth automated analysis of privacy policies.

Who is this for? 👩‍⚖️🕵️‍👨‍💻

We have three audiences envisioned:

  • General users: We designed PriBot and Polisis to be highly intuitive for the general user who is interested in the privacy aspects of the sites they use.
  • Regulators: We envision that the technology powering Polisis could be used for large-scale analysis of privacy policies by regulation agencies. For example, in our paper, we used Polisis to show how privacy certification companies (such as TRUSTe — now TrustArc) have been highly permissive with businesses.
  • Researchers: A lot of research has studied apps and websites in an automated way based on their code, their embedded scripts, or on what they share over the network. A missing piece of the puzzle is what these apps promise in their privacy policies. We hope our work can empower researchers with new insights from the policies’ angle.

To that goal, we would be glad to collaborate with regulators, researchers, and the general industry. Feel free to reach out if you are interested.

No Magic Pill

Now to the research part!

You might say: “couldn’t you do the above by combining a few APIs or open source projects? Didn’t IBM Watson beat humans at answering “Jeopardy!” questions? And why couldn’t you use commercial services like Microsoft QnA Maker?”

The answer is that such systems are not magic pills. If they are trained on specific domains, like insurance questions, they can be rarely adequate for others. If they are trained on a general domain, they suffer when tested with specific problems.

Imagine asking the question: “do you gather my address info?” around a privacy policy. Almost every QA system will favor the answer “for your info, we work hard to address your issues” over the answer “we use your location for customizing our service.” Obviously, the second is the better answer. Yet, this is not easy to get.

What makes things harder is that there are no public datasets of questions and answers about privacy policies waiting to be trained. So traditional approaches to this problem were not the way forward.

A Hierarchical Approach 💤

We took the other way around. We focused on solving the problem of automatically labeling segments in the privacy policy, producing Polisis. Then we leveraged that solution for the QA problem, producing PriBot. On a high-level, our approach for automatically labeling segments was as follows:

  • Unsupervised Learning step: We first trained a word embedding model on 130K privacy policies that we collected.
  • Supervised Learning step: We then trained a hierarchy of 22 classifiers (each being a neural network) for labeling the different aspects of the policy. We relied on the valuable OPP-115 dataset from the Usable Privacy Project for this part.

Spoiler: if you just read these steps, and you try to reproduce the results, that will lead to a horrible performance. Our paper discusses the devil in the details, which lead to a high accuracy, from data preprocessing to classifier selection, etc.

How can we move from classification to QA?

Let’s say you had a question “Do you share my info?”. To get an answer, we first break the policy into small standalone segments. Each segment is a candidate answer. Then, we rank the answers by their similarity to the question.

The similarity is measured by seeing which answers receive “close” labels to the question. To get these labels, we pass both the question and the answers through our classification hierarchy. How to define “close” is also important. For example, questions are frequently broad. You don’t want to frustrate the user by with close, but generic answers. Hence, we came up with a new similarity algorithm to account for this issue (details in the paper).

Examples We Love

Hopefully, by now you are willing to give our tools a try. Head over to pribot.org to see them both in action.

And to get inspired when you stress-test PriBot, you can check the following examples.

Here, PriBot works although there are no common words between the question and the answer:

…or when our hands get sloppy, and we misspell a few critical words (you can thank subword embeddings for this):

PriBot also notifies users when there is a contradiction in the potential answers:

…and it tries to not appear stupid when presented with irrelevant questions:

Likewise, you can give Polisis a go, and you can a few examples below:

With Fitbit’s policy, you can get an overview of what data the company collects. By clicking on “Analytics/Research”, you can see all the data being collected for that purpose, with the options you get.

You can also see that Fitbit shares health information with third parties in the second tab. Hovering over the link will give you the exact evidence from the policy itself.

If the policy gives you choices to mitigate data collection, you can see these choices in the dedicated tab, along with the links to opt-in or opt-out.

Finally, if there is no information about a certain aspect in the policy, we give an explanation on why this is the case.

We hope you enjoy playing with these services, and we welcome your feedback. We know well the limitations of this technology. Hence, we do not claim that the results are legally binding or are completely accurate.

Yet, we believe that this a great step forward on the road to Make Privacy Policies Cool (I’m tempted to end the sentence with ‘Again’, but they were never cool before 😀 )!

More by Hamza Harkous

More Related Stories