We're sure you've heard the claim that "data is the new oil".
Web scraping is the process of mining data from the World Wide Web for a specific purpose. In the most simple form, it is copying and pasting a specific set of information to a local database for archival use, scientific analysis, or some other use.
Some of the most widely used examples include aggregator websites which provide price comparisons for online goods.
There are also sites like
But web scraping can also be used in fascinating ways, with high social impact.
Most recently, a group of Lithuanian activists created a website that allows Russian speakers from around the world to call people living in Russia with limited access to news about the war in Ukraine.
The idea was to form personal human connections, using one-on-one interactions over the phone, and let people know about the atrocities of war that their government was undertaking in Ukraine.
The website,
Web pages include a lot of useful information in text form (built on HTML or XHTML). Usually, a bot called a web crawler, “scrapes” (collects) the data from a site.
Some web pages have built-in mechanisms to prevent web crawlers from scraping data. In response, some web scraping systems have evolved to simulate human browsing using techniques like DOM parsing, computer vision, and even natural language processing.
Here is a 5 min video if you are interested in learning more.
https://www.youtube.com/watch?v=CDXOcvUNBaA
The first-ever web crawler was called the Wandex, and it was programmed by an MIT student. The crawler’s main purpose was to measure the size of the internet and it operated from 1993 to 1995.
The First API (Application Programming Interface) crawler came five years later. Today many major websites like Twitter offer web APIs for people to access their public databases.
But why would we want to scrape or mine data in the first place, and why would another party try to prevent us from doing it?
Web scraping applications range from really successful commercial ideas like price comparison tools to many other use cases, like social justice and ethical big data.
Web scraping makes us face some important questions. Should all information be public - and equally accessible to all? What about the issue of copyright?
On the commercial side, building a price comparison tool might lead to some businesses losing customers to the competition. Sometimes major corporations like airlines sue scrapers and data miners for copyright infringement on these grounds.
Even though scrapers are technically collecting and displaying data that is already publicly available, the suits tend to argue for copyright infringement. There is no standard outcome for these kinds of lawsuits. It usually depends on a number of factors like the extent of the information collected or the incurred losses.
The legality of web scraping is still not fully fleshed out. Terms of use on a specific site might “ban” it, but that is not exactly enforced by law in all cases. In order for the mining of the data to be unlawful, it would have to go against an already existing law.
In America, that might be, most commonly, on the grounds of copyright infringement. Other examples include Denmark, where the courts found web scraping or crawling to be legal according to Danish Law.
In France, the French Data Protection Authority ruled that even when publicly available, personal data still cannot be collected and/or repurposed without the knowledge of the person to whom it belongs.
When it comes to nonprofit organizations and open access advocates, things get even more interesting.
The Internet Archive (archive.org) is a famous web scraping project. It is a non-profit organization that archives (sometimes deleted) web pages, digital collections, books, pdfs, and videos for researchers, students, and anyone else who takes an interest.
They sometimes get caught in legal grey areas every now and then, when individuals or even governments take legal action to remove some specific pieces of content.
There are many web scraping projects which advocate for universal open access to information like the PACER project.
PACER is the name of the website that houses legal documents from US courts. It stands for Public Access to Court Electronic Records but the access is not free except for a select number of public libraries.
The Late Aaron Swartz, an open-access advocate and early internet prodigy,
Corporations and governments might be incentivized to outlaw web scraping. However, it is an important tool journalists and researchers use to uncover injustices.
Collecting and analyzing data can be incredibly helpful for all types of research and academic study, leading to a new movement in data science. Journalists also now rely on careful data analysis to reveal new things about our societies and communities.
__Reveal__carried out a project revealing American cops that were members of extremist groups on Facebook posting and engaging with racist, xenophobic, and Islamophobic content.
It was done by scraping data from these extremist groups and from groups of police officers on Facebook and cross-referencing to find the overlapping members - and there were many.
Reuters used similar data analysis techniques to uncover
Using scrapers, the Verge and the Trace did an investigation revealing
USA Today found out that between 2010 and 2018 more than 10,000 bills introduced in statehouses nationwide were almost
The Atlantic runs a
These are just some of the examples of the ways web scraping can be used for both commercial and social justice purposes. There are many other use cases out there and many more waiting to be realized.
Extensive data analysis and open data science can unlock so many new truths but are we crossing the line with the kind of data we collect, and the methods we use to collect it?
What are the ethics and school of thought around data collection?
While it is important that we continue the conversation about open access to documents that are relevant to the public, we also have to consider privacy issues.
Today many people and organizations agree that collecting and using someone’s personal data without their consent is unethical.
However, what about public data such as news articles that are censored in some countries? Or health-related statistics and data that can be used for public health policy suggestions?
In the US, __policymakers used an algorithm__to identify high-risk patients for a preventive program to provide additional care so that these patients don't end up in the ER.
https://www.youtube.com/watch?v=Ok5sKLXqynQ
Later researchers found that black people were sicker than white people, though within the same category. In other words, black patients incur fewer costs than white patients with the same illnesses for a variety of reasons including lack of access to high-quality insurance.
In another instance, __automated hiring tools __used by companies like Amazon were found to be favoring men over women and white people over people of color.
When the tools searched the web, they determined that executive positions were filled mostly by white men, so the machines learned that this was the type of quality to look for in a candidate.
Scraping public data for the public good does not always lead to positive results for society. Automation and machine learning needs thoughtful intervention. As builders of new technological and social systems, we need to ensure all of our data analysis tools are ethically designed and don't continue our historical systems of injustice and discrimination.
Scraping is highly relevant to the work we do at Mysterium. We care about building an accessible web where freedom of information and open data science become foundational pillars of the new web.
We are collaborating with developers to build the Web3. To learn more about the way Mysterium empowers builders in the Web 3 space for purpose-driven projects check out our site.