Whenever someone asks me what I work on, I find nothing better than a single link to a YouTube series of 9 hours filming an actor reading Amazon Kindle’s terms and conditions.
Yes, you read it correctly: 9 HOURS.
Here is the trailer:
And this is a link to the full series.
Hopefully, you’re still with me at this early stage and you didn’t get lost in an endless spiral of similar videos recommended by YouTube, such as people streaming themselves reading privacy policies of , , and e, for hours.
Actually, the privacy policy, the equally annoying sibling of terms and conditions, is the subject of this article.
Have you thought how long it would really take to read all the policies for services we use per year? That would be 201 hours according to a research by McDonald and Cranor in 2008.
So given the choice between this unpaid, exhausting task and anything else, it’s no wonder that people prefer spending time fulfilling their yearly exercising goals, reading an amusing book, or just relaxing. This is even when they hear frightening stories of what’s inside privacy policies (like this and this).
Researchers have tried to make these policies simpler, primarily by manual methods, like websites offering standardized versions of their policy, inspired by nutrition labels. Approaches relying on the wisdom of the crowd have received good traction too (like the “Terms of Service; Didn’t Read” project). Yet, these attempts didn’t scale due to the huge manual effort and human expertise involved.
Two years ago, my colleagues and I wrote a paper for a workshop on the future of privacy notices. In it, we proposed a vision for turning privacy policies into a conversation via a chatbot called PriBot.
Let’s admit it: 2016 was the year of chatbots, and we were — shamelessly — motivated by the hype. After all, the best interface is no interface. Right?
Our idea was that you could ask PriBot about any privacy policy like you ask Siri for the capital of Ivory Coast (spoiler: Yamoussoukro). PriBot would then respond in real time with an answer from the policy itself.
Twenty months later, we have taken our vision to fruition in a new research paper, in which we first show how we realized the goal of automated Question Answering for privacy policies. This is a collaboration between Hamza Harkous (yours truly), Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin and Karl Aberer.
As our first research outcome, we’re introducing PriBot to the public, via a chatbot that you can converse with right now at pribot.org/bot:
PriBot in action
On the way to build PriBot, we had a surprising byproduct, which even has the potential to make a wider impact: we built a general system for automatically analyzing privacy policies using machine learning. We call it Polisis.
Polisis gives you a glimpse about of privacy policy, like the data being collected, the information shared with third parties, the security measures implemented by the company, the choices you have, etc. All that without having to read a single line of the policy itself.
We’re releasing Polisis too at pribot.org/polisis. You can alternatively download its Chrome extension or Firefox addon to analyze websites in a click.
Polisis in action
To our knowledge, Polisis is the first system to provide such in-depth automated analysis of privacy policies.
We have three audiences envisioned:
To that goal, we would be glad to collaborate with regulators, researchers, and the general industry by providing them with API access to Polisis. Feel free to reach out if you are interested.
Now to the research part!
You might say: “couldn’t you do the above by combining a few APIs or open source projects? Didn’t IBM Watson beat humans at answering “Jeopardy!” questions? And why couldn’t you use commercial services like Microsoft QnA Maker?”
The answer is that such systems are not magic pills. If they are trained on specific domains, like insurance questions, they can be rarely adequate for others. If they are trained on a general domain, they suffer when tested with specific problems.
Imagine asking the question: “do you gather my address info?” around a privacy policy. Almost every QA system will favor the answer “for your info, we work hard to address your issues” over the answer “we use your location for customizing our service.” Obviously, the second is the better answer. Yet, this is not easy to get.
What makes things harder is that there are no public datasets of questions and answers about privacy policies waiting to be trained. So traditional approaches to this problem were not the way forward.
We took the other way around. We focused on solving the problem of automatically labeling segments in the privacy policy, producing Polisis. Then we leveraged that solution for the QA problem, producing PriBot. On a high-level, our approach for automatically labeling segments was as follows:
Spoiler: if you just read these steps, and you try to reproduce the results, that will lead to a horrible performance. Our paper discusses the devil in the details, which lead to a high accuracy, from data preprocessing to classifier selection, etc.
How can we move from classification to QA?
Let’s say you had a question “Do you share my info?”. To get an answer, we first break the policy into small standalone segments. Each segment is a candidate answer. Then, we rank the answers by their similarity to the question.
The similarity is measured by seeing which answers receive “close” labels to the question. To get these labels, we pass both the question and the answers through our classification hierarchy. How to define “close” is also important. For example, questions are frequently broad. You don’t want to frustrate the user by with close, but generic answers. Hence, we came up with a new similarity algorithm to account for this issue (details in the paper).
Hopefully, by now you are willing to give our tools a try. Head over to pribot.org to see them both in action.
Here, PriBot works although there are no common words between the question and the answer:
…or when our hands get sloppy, and we misspell a few critical words (you can thank subword embeddings for this):
PriBot also notifies users when there is a contradiction in the potential answers:
…and it tries to not appear stupid when presented with irrelevant questions:
With Fitbit’s policy, you can get an overview of what data the company collects. By clicking on “Analytics/Research”, you can see all the data being collected for that purpose, with the options you get.
You can also see that Fitbit shares health information with third parties in the second tab. Hovering over the link will give you the exact evidence from the policy itself.
If the policy gives you choices to mitigate data collection, you can see these choices in the dedicated tab, along with the links to opt-in or opt-out.
Finally, if there is no information about a certain aspect in the policy, we give an explanation on why this is the case.
We hope you enjoy playing with these services, and we welcome your feedback. We know well the limitations of this technology. Hence, we do not claim that the results are legally binding or are completely accurate.
Yet, we believe that this a great step forward on the road to Make Privacy Policies Cool (I’m tempted to end the sentence with ‘Again’, but they were never cool before 😀 )!
And thanks for reading! You might also be interested in checking my other articles on my Medium page:
… or my website: