Machine Learning Research
Ever wanted to rob a house?
We discovered that, with nothing more than publicly-available data, you can figure out the real names and addresses of 40% of houses listed on Airbnb. We’ll show you how anyone could use this method to rob their choice of 1.2 million houses — and other ways this could hurt Airbnb’s bottom line.
Over three million houses are listed on Airbnb, the popular house-sharing platform that many travelers use as an alternative to hotels. Because hosts are offering up their actual houses to strangers, Airbnb reassures hosts that their privacy is totally protected.
Before you book a room on Airbnb, you can only see the host’s first name and the rough location of their house:
But we found a way for anyone to figure out a host’s full name, address, and other personal information without booking a room, a method that has surprisingly high accuracy. All it takes is some ingenuity and publicly-available data.
All 50 states are legally required to make their voter files public. This means that anyone can get the name, address, date of birth, and voting registration history of any voter in a state- over 200 million individuals nationwide.
Harvard professor Latanya Sweeney gave us Wisconsin’s voter file, which contained information about all 3.4 million registered voters in the Badger State. Here’s the basic information we had on each person from that voter file:
Since Airbnb does not expose an API for requesting and filtering listings, we put on our white hats and figured out a way to reverse-engineer Airbnb’s web traffic to grab all the listings in Wisconsin. For each listing, we got information about the host’s first name, their town, and their approximate latitude and longitude — that’s it.
The next step was to write an algorithm that would match each Airbnb host with a voter in the voter file. For instance, say we knew from the voter file that there are ten Douglases living in Madison, Wisconsin. And say we found an Airbnb listing in Madison whose host was named Douglas. We’d run a reverse geocoder to convert each Douglas’s address to a latitude and longitude, and then use a distance formula to determine how far each Douglas in the voter file was from the Douglas who owned the listing.
Our hypothesis was that the closest Douglas to the listing was the most likely to be the host. If we were right, we’d immediately know the host’s full name, address, phone number, and all the other information on them in the voter file!
To test our method, we ran the algorithm on 84 Airbnb listings in Wisconsin. Then, each of us independently of the others manually determined the identity of each of the Airbnb hosts we tested to see if our algorithm’s prediction was correct. We had a bag of tricks we used to identify people. For one, we could visually compare people’s houses on Google Maps Street View to the pictures available in the Airbnb listing:
Other tricks involved looking up the host on LinkedIn (sometimes their Airbnb profile would tell you their alma mater), using Google’s reverse image search on the host’s profile picture, or using reviews to triangulate their location (one review said the house was “past three mailboxes, up the road, and the second house on the right.”)
It turns out that our algorithm figured out the correct host for 34 of the 84 properties we tested — that’s a 40% success rate. If our sample was representative, that means that you could find the full name and address of 1.2 million of the 3 million Airbnb hosts out there.
If you wanted to rob a house using Airbnb, you could just book the house, find the address, and ransack it once the owners leave. But that’s not such a hot idea, because Airbnb makes you provide your government-issued ID and connect your social media profiles before you book a room. So if an Airbnb host found their china missing one morning, it wouldn’t be too hard for them to find you.
Instead, here’s how a bad guy with some technical chops could rob a house. They could find Airbnb listings in their area that were available for tonight — especially the “instantly bookable” ones, since that feature is usually used by owners who are on vacation. Then they could use this algorithm to figure out the address of a target house. Since no one will be home, the bad guy can rob the house without anyone knowing it was them.
To be clear, we don’t condone robbery, but we know that there are robbers out there — so for the sake of millions of hosts, Airbnb needs to take a hard look at this gaping security hole.
We hear you cynics out there: “Airbnb doesn’t care if people’s homes get robbed! All companies care about are their profits.” And you might be right. But even if the company doesn’t care that hosts are in danger, you can bet that Airbnb would care if their bottom line were in danger. As it turns out, their cash flow is anything but safe.
Airbnb charges hosts a 3–5% commission and guests a 6–12% fee on every booking. But what if guests and hosts could connect outside the platform? Airbnb is pretty paranoid about this, blocking any attempt between guests and hosts to exchange email addresses or phone numbers. Yet what if a prospective guest could find a listing on Airbnb, use our algorithm to figure out the host’s home address and phone number, and arrange for a rental and payment outside the Airbnb app? This would save both money and leave Airbnb holding the bag.
Sound good? Here’s why Airbnb should take note: Much of Airbnb’s revenue comes from these service fees. Since Airbnb made $900 million in 2015, Airbnb’s faulty privacy guarantee could cost them up to $360 million per year.
It’s no secret that Airbnb is unpopular with many cities, such as New York and San Francisco, for evading the usual hotel taxes that are a major source of revenue for cities. In other cases, unscrupulous landlords use Airbnb to operate illegal hotels in rent-controlled housing units, which hurts the low-income residents the housing was intended for.
In 2016, New York made it illegal to advertise housing for less than 30 days, a law aimed at Airbnb. But since neither Airbnb hosts nor their addresses were publicly available, New York has had trouble enforcing this law. But with this algorithm, regulators in New York and elsewhere could find and crack down on illegal Airbnb hosts by reverse-engineering the addresses of listings that are in violation of the law.
Airbnb’s faulty privacy guarantee could cause the company a world of pain, not to mention the negative effects on the hosts. So what could they do? Think back to how Airbnb provides an approximate latitude and longitude for each listing — or, as it appears on the map, a fuzzy blue circle. Importantly, we found that the circle’s radius is about 0.7 miles no matter where you’re looking to rent.
Airbnb could increase the radius of the blue circle to make our kind of reidentification algorithm less accurate. The tradeoff is that this would make Airbnb less useful to potential renters, since location is so important to travelers. It’s especially bad in major cities: someone renting in rural Wisconsin might not mind an error of one mile, but a radius of one mile could hide the difference between renting an apartment in downtown Manhattan and one in New Jersey (New York native Aron shudders)!
Our advice? Airbnb might need to adjust their fuzzing radius based on the local population density. Denser areas would be fine with a smaller radius, since there are lots of people in even a small circle, but sparsely-populated rural areas would need much more fuzzing to protect privacy.
If you agree that Airbnb’s faulty privacy guarantees can put hosts — and Airbnb — at risk, write to Airbnb to tell them they need to do something about it.
One final note: out of respect for the privacy of those we’ve reidentified over the course of our work, we’re not going to post a public proof of concept. If you’d like to see our work in action, give us a shout.
Thanks to Professor Latanya Sweeney, Ji Su Yoo, and Jinyan Zang for their guidance.