Summary: This is the backstory of how we built pribot.org . There is a Wired article about the project already. This article provides our story of what went behind the scenes. Whenever someone asks me what I work on, I find nothing better than a single link to a YouTube series of 9 hours filming an actor reading Amazon Kindle’s terms and conditions. Yes, you read it correctly: 9 HOURS. Here is the trailer: And this is a link to the full series . Hopefully, you’re still with me at this early stage and you didn’t get lost in an endless spiral of similar videos recommended by YouTube, such as people streaming themselves reading privacy policies of , , and e, for hours. Google Facebook Appl Actually, the privacy policy, the equally annoying sibling of terms and conditions, is the subject of this article. Have you thought how long it would really take to read all the policies for services we use per year? That would be according to a research by in 2008. 201 hours McDonald and Cranor So given the choice between this unpaid, exhausting task and anything else, it’s no wonder that people prefer spending time fulfilling their yearly exercising goals, reading an amusing book, or just relaxing. This is even when they hear frightening stories of what’s inside privacy policies (like and ). this this Researchers have tried to make these policies simpler, primarily by manual methods, like websites offering of their policy, inspired by nutrition labels. Approaches relying on the wisdom of the crowd have received good traction too (like the “ ” project). Yet, these attempts didn’t scale due to the huge manual effort and human expertise involved. standardized versions Terms of Service; Didn’t Read Fixing Privacy Policies: the Backstory 🛠 Two years ago, my colleagues and I wrote a for a workshop on the future of privacy notices. paper In it, we proposed a vision for turning privacy policies into a conversation via a chatbot called . PriBot Let’s admit it: 2016 was the year of chatbots, and we were — shamelessly — motivated by the hype. After all, the best interface . Right? is no interface Our idea was that you could ask PriBot about any privacy policy like you ask Siri for the capital of Ivory Coast . PriBot would then respond in real time with an answer from the policy itself. (spoiler: Yamoussoukro) Fast forward ⏩ Twenty months later, we have taken our vision to fruition in a new , in which we first show how we realized the goal of automated Question Answering for privacy policies. This is a collaboration between (yours truly), , , , and . research paper Hamza Harkous Kassem Fawaz Rémi Lebret Florian Schaub Kang G. Shin Karl Aberer As our first research outcome, we’re introducing PriBot to the public, via a chatbot that you can converse with right now at : pribot.org/bot PriBot in action On the way to build PriBot, we had a surprising byproduct, which even has the potential to make a wider impact: we built a general system for automatically analyzing privacy policies using machine learning. We call it Polisis. gives you a glimpse about of privacy policy, like the data being collected, the information shared with third parties, the security measures implemented by the company, the choices you have, etc. All that without having to read a single line of the policy itself. Polisis We’re releasing Polisis too at . You can alternatively download its or to analyze websites in a click. pribot.org/polisis Chrome extension Firefox addon Polisis in action To our knowledge, Polisis is the first system to provide such in-depth automated analysis of privacy policies. Who is this for? 👩‍⚖️🕵️‍👨‍💻 We have three audiences envisioned: We designed PriBot and Polisis to be highly intuitive for the general user who is interested in the privacy aspects of the sites they use. General users: We envision that the technology powering Polisis could be used for large-scale analysis of privacy policies by regulation agencies. For example, in our , we used Polisis to show how privacy certification companies (such as — now TrustArc) have been highly permissive with businesses. Regulators: paper TRUSTe A lot of research has studied apps and websites in an automated way based on their code, their embedded scripts, or on what they share over the network. A missing piece of the puzzle is what these apps promise in their privacy policies. We hope our work can empower researchers with new insights from the policies’ angle. Researchers: To that goal, we would be glad to collaborate with regulators, researchers, and the general industry. Feel free to if you are interested. reach out No Magic Pill Now to the research part! You might say: “couldn’t you do the above by combining a few APIs or open source projects? Didn’t IBM Watson beat humans at answering “Jeopardy!” questions? And why couldn’t you use commercial services like ?” Microsoft QnA Maker The answer is that such systems are not magic pills. If they are trained on specific domains, like insurance questions, they can be rarely adequate for others. If they are trained on a general domain, they suffer when tested with specific problems. Imagine asking the question: “ ” around a privacy policy. Almost every QA system will favor the answer “ ” over the answer “we use your location for customizing our service.” Obviously, the second is the better answer. Yet, this is not easy to get. do you gather my address info? for your info, we work hard to address your issues What makes things harder is that there are no public datasets of questions and answers about privacy policies waiting to be trained. So traditional approaches to this problem were not the way forward. A Hierarchical Approach 💤 We took the other way around. We focused on solving the problem of automatically labeling segments in the privacy policy, producing . Then we leveraged that solution for the QA problem, producing . On a high-level, our approach for automatically labeling segments was as follows: Polisis PriBot We first trained a model on 130K privacy policies that we collected. Unsupervised Learning step: word embedding We then trained a hierarchy of 22 classifiers (each being a neural network) for labeling the different aspects of the policy. We relied on the valuable OPP-115 dataset from the for this part. Supervised Learning step: Usable Privacy Project : if you just read these steps, and you try to reproduce the results, that will lead to a horrible performance. Our discusses the devil in the details, which lead to a high accuracy, from data preprocessing to classifier selection, etc. Spoiler paper How can we move from classification to QA? Let’s say you had a question “Do you share my info?”. To get an answer, we first break the policy into . Each segment is a candidate answer. Then, we rank the answers by their similarity to the question. small standalone segments The similarity is measured by seeing which answers receive “close” labels to the question. To get these labels, we pass both the question and the answers through our classification hierarchy. How to define “close” is also important. For example, questions are frequently broad. You don’t want to frustrate the user by with close, but generic answers. Hence, we came up with a new similarity algorithm to account for this issue (details in the ). paper Examples We Love Hopefully, by now you are willing to give our tools a try. Head over to to see them both in action. pribot.org And to get inspired when you stress-test PriBot, you can check the following examples. Here, PriBot works although there are no common words between the question and the answer: …or when our hands get sloppy, and we misspell a few critical words (you can thank for this): subword embeddings PriBot also notifies users when there is a contradiction in the potential answers: …and it tries to not appear stupid when presented with irrelevant questions: Likewise, you can give a go, and you can a few examples below: Polisis With Fitbit’s policy, you can get an overview of what data the company collects. By clicking on “Analytics/Research”, you can see all the data being collected for that purpose, with the options you get. You can also see that Fitbit shares health information with third parties in the second tab. Hovering over the link will give you the exact evidence from the policy itself. If the policy gives you choices to mitigate data collection, you can see these choices in the dedicated tab, along with the links to opt-in or opt-out. Finally, if there is no information about a certain aspect in the policy, we give an explanation on why this is the case. We hope you enjoy playing with these services, and we welcome your feedback. We know well the limitations of this technology. Hence, we do not claim that the results are legally binding or are completely accurate. Yet, we believe that this a great step forward on the road to Make Privacy Policies Cool (I’m tempted to end the sentence with ‘Again’, but they were never cool before 😀 )! And thanks for reading! You might also be interested in checking my other articles on my Medium page: _Read writing from Hamza Harkous on Medium. Postdoc at EPFL, Switzerland; working at the intersection of Privacy, NLP…_medium.com Hamza Harkous — Medium … or my website: _Personal website of Hamza Harkous_hamzaharkous.com Hamza Harkous’ Site

Maker

Amazon

Bose

Facebook

Google

Microsoft

Mozilla

Uber

YouTube

A Guide to Scaling Machine Learning Models in Production

Too Long; Didn't Read

We Gave Privacy Policies an AI Overhaul, and You’ll Never Have to Read Them Again!

We Gave Privacy Policies an AI Overhaul, and You’ll Never Have to Read Them Again!

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Guide to Scaling Machine Learning Models in Production

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

A Guide to Scaling Machine Learning Models in Production

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps