5,827 reads

Web Scraping and the Battle for Open Internet

by Aaron GershwinOctober 31st, 2019

Too Long; Didn't Read

A web scraping firm HiQ won a CFAA lawsuit against Microsoft’s LinkedIn. HiQ specializes in automatically gathering social media data and aggregating it. Scraping is a broad term to define an automated way to collect public data. The technology efficiency of web scraping has been going up for years. Talented engineers are building complex software and networking solutions just for data gathering. Some web scrapers hate web scraping because of the load on their servers. But this is precisely opposite of what professionals are trying to achieve to achieve.

Companies Mentioned

featured image - Web Scraping and the Battle for Open Internet

A few years ago, Cambridge Analytica made netizens concerned regarding the gathering of their online data. At that time, affected or interested users had little knowledge of how big the big-data industry actually was.

Professionals call it web scraping, and it recently got a significant legal boost. A web scraping firm HiQ won a CFAA lawsuit against Microsoft’s LinkedIn and made some open web enthusiasts very happy. Continue reading to learn what happened and how this contributes to the vision of open internet.

Web Data Scraping Is on the Rise

Yes, Cambridge Analytica scraped Facebook user data, and they did so by breaking the law. However, due to huge media exposure web-scraping and Cambridge Analytica became almost synonyms, and that’s just not right.

Much like the infamous Cambridge Analytica, HiQ specializes in automatically gathering social media data and aggregating it. To do this, it uses web scraping scripts that visit users’ profiles and save any relevant data for later use. The important part is that Cambridge Analytica did it illegally thanks to Facebook’s leaky platform, while HiQ got the public data legally.

Online data is a massive business in 2019, and it’s no surprise that the technological efficiency of web scraping has been going up for years. Talented engineers are building complex software and networking solutions just for data gathering. Other professionals are even running ‘best proxy provider research tests’ to determine who gives a web scraper the best access to public data online.

While all this might leave you wondering how web scraping is good for our society, there are many examples of how this data helps us every day, so let’s dive in.

You Use Scraped Data Too

How do you look for the best airline ticket prices online? Do you compare each airline company separately, or do you visit a flight aggregator and get all your information in one place? What’s more, don’t you love how Google has the best answers on the Web?

Price ranking pages, discount websites, statistical analysis companies, — all use web scraping to reach their ends. Technologies like web scraping are how these services get their data. They aggregate it, and then we can use it to make our lives better.

Web scraping has fascinated me for a while, but the way it works out legally gave me a headache for a while. That is why I closely followed HiQ’s lawsuit and win against LinkedIn. It means a lot for all the web scrapers, as well as casual internet users.

Web Scraping In Reality

Web scraping is a broad term to define an automated way to collect public data. A scraper takes a website’s HTML (or other) file, detects the data it needs and then saves it.

Imagine spending the afternoon comparing the prices of roof shingles from five different providers. You could visit their sites, note the prices of some products and so on. Our you could use a web scraper that would collect that data automatically, while another script could aggregate it for easy review.

Since all this data is public, web scraping automates the openness of the Web to gain the most value in the shortest amount of time. Nevertheless, website managers hate web scraping because of the load on their servers.

Servers Battling Scrapers Battling Servers

Today, there are hundreds of tools for web scraping. They are quite functional, but many programmers also create custom solutions. These are a lot more flexible and can scrape even protected websites. And by ‘protected’ I do mean public web pages that have technical limitations on how a device can connect to it. We’re not talking about legal issues here that haunted Cambridge Analytica.

Most of the time, a website’s server has limits on the number of connections any visitor can use. If you accidentally (and highly unlikely) send a 100 connection requests per second to google.com, your IP address would get a temporary block.

Since web scraping software has to visit hundreds of pages every minute, most servers automatically block their IP addresses after a set amount of connection requests. These limits reduce the efficiency of web scraping. And they cost a lot of money. They are also the reason why web scrapers use proxy networks to change their IP address for each connection.

This is where we get into the legal zone. Overloading anyone’s server is a DDoS attack, which can bring much trouble for the scraper. However, this is precisely the opposite of what web scraping professionals are trying to achieve. They want to help their users deal with huge data loads on the internet efficiently and transparently, and connection limits, or weak servers, is an obstacle.

The Legal Side Of Web Scraping

Overloading someone’s server with connection requests is bad, there’s no denying that. But web scrapers don’t just blindly send out requests to bring down websites. Every good web scraping expert knows where to look for the connection limits of any server: the Terms of Service (ToS) of a site.

That’s right, the lengthy legalese that no one ever reads is every scraper’s first stop on a website.

The TOS is extremely convenient when you’re in the legal grey zone. They define whether the data is public, copyrighted, and how anyone is allowed to connect to the website and gain access to data in question. In some cases, public data websites already have API access, which makes getting the data a lot easier for everyone.

Even though most web scrapers stay true to the TOS, the few bad apples who don’t make the industry look bad. Such monopolists as Facebook, Google, and LinkedIn, — yes, they tend to take the data they need and use it for whatever they want, thus discrediting the whole industry.

As always, there’s more to the picture, and in this article, I aim to show that web scraping is an efficient and legal way to enhance one’s online experience.

Stealing Data Or Fighting for the Open Web?

Companies love to keep data for themselves, even when they’re available to almost anyone. That’s why there were dozens of legal cases on web scraping over the decades.

Perhaps the most famous and ruthless anti-scraping company is Ryanair. Over the years, it has won several cases against services that scraped its pricing pages. That’s why you will no longer see the real price of a flight ticket until you log in. Prices are locked down behind the login page to make you agree to a strict anti-scraping User Policy.

Crazy, right? Why would an airline block services that help them sell tickets? As a frequent flier, I have no respect for a company that would rather save a penny instead of giving data to a price comparison service. And it’s not like Ryanair is losing any business either — they get the full ticket price, after all.

Because of companies like Ryanair, we’re slowly losing the last bits of open data online.

Why Scraping Getting a Legal Win Is Important

As companies try to close down the open Web, LinkedIn also stepped into the legal fray. It raised a case against HiQ, a small scraping company that was collecting data off publicly available LinkedIn profiles.

Now, before you start comparing HiQ to Cambridge Analytica, hear me out. They were scraping public profiles, which means the data was available to anyone without a login. Cambridge Analytica had to exploit a vulnerability (which is a criminal act) to gain unauthorized access to user data. Still, LinkedIn tried to claim that it owned the data of its users. And that’s where I draw a line — corporate social media giants claiming our data — claiming us — as their own.

Sadly, sometimes, it works. Craigslist has done it in the past. Back in 2013, before an out of court settlement with 3taps Inc., Craigslist had successfully argued that user-generated content — ads — were owned by Craigslist!

Consider the following: you post two identical ads, — one on Craigslist, the other on eBay, and Craigslist sues you for copyright infringement. So much for owning your creative work, freedom of the Web and the open internet.

The issue is not fighting over who owns what. It’s about creating concrete regulations, laws, and privacy policies and give users the power of choice. Do I want this data to be public? Can I choose to make some data public, while other facts remain available only to my connections? These regulations are a hot topic now in the big-data industry.

That is why I am glad that HiQ won the case against Microsoft’s precious LinkedIn. Corporations have enough advantages already. They don’t need exclusive automated access to data on the internet. Web scraping, while sometimes shady or borderline criminal, is an efficient tool that is not inherently bad. As with all technology, we just need to be smart and ethical about it.