The Evolution of Big Data And Web Scraping

Written by cerniauskas | Published 2020/07/18
Tech Story Tags: web-scraping | data-scraping | web-crawling | ethical-data | big-data | python | amazon | scraping-using-python

TLDR The CEO of proxy service and data scraping solutions provider Oxylabs explains how web scraping can be used for good or evil. He explains some of the reasons why web scraping has been given a bad reputation. He also offers some ideas on how to use it ethically. These are examples of how Web Scraping can positively affect our lives in addition to offer some ideas for ethical use. For example, price shopping websites, search engine optimization websites, and tracking fake news sites, he says.via the TL;DR App

As the CEO of a proxy service and data scraping solutions provider, I understand completely why global data breaches that appear on news headlines at times have given web scraping a terrible reputation and why so many people feel cynical about Big Data these days. 
At the same time, I recognize that we have great clients who do important work with Big Data and see this in action on common websites most people use each day.
This article is going to describe some important examples of how web scraping can positively affect our lives in addition to offering some ideas on how to do it ethically. 

Web Scraping as a Force for Good

Almost anything in the world can be used for good or evil, Big Data included. It all depends on the intention. Here are my favorite examples for web scraping uses that add value to the internet in a significant way: 
Price aggregator websites
For many of us, price shopping is great fun, even when it’s for things we don’t really need. If you’re looking for a new laptop, electric mixer, or noise-canceling headphones, the options are numerous. At the same time, if you want to book a charter on a multi-million dollar yacht, Big Data has got you covered for that as well. 
Whether we are booking flights or hotel rooms, buying cars, or private jets, there seems to be an endless array of sites that are bringing the competitive advantage back to the consumer for a diverse range of goods and services. And all that is thanks to web scraping.
Tracking Fake News
Journalistic integrity is increasingly becoming a worldwide concern because fake news can be dangerously disruptive to almost every facet of our lives, from events in politics to information about health.
A handful of startups are combatting the problem with solutions that include machine learning algorithms that can process large amounts of data from thousands of sources and determine the level of accuracy and political slant, among other factors. This development represents a significant advance in the sharing of information and that will benefit everyone.
Reputation Management
Not only is the market highly competitive these days, but consumers are as sensitive as ever. Brand monitoring and reputation management are essential to protect the good standing of products, services, and even your name. As long as companies scrape data legally, billions of sources can help them ensure a spotless reputation for consumers, brands, and anyone who operates in the public eye.
Tracking World News & Events
Web scraping can be used to track statistics from events shaping our world, from economic statistics to financial market indicators to the effects of communicable diseases. Examples of the latter include the partnership between Oxylabs and students from Stanford, University of Virginia, and Virginia Tech for the TrackCorona website, in addition to our cooperation with the University of Lugano in Switzerland for the production of CoronaMapper.
Search Engine Optimization (SEO)
There’s something about the David vs. Goliath story that makes people root for the underdog. Think Rocky facing an intense match in the ring when hope seems to be lost, or the 300 Spartans facing an army of hundreds of thousands of soldiers from Persia.
One doesn’t need to go to history or fiction to find these stories. We see them every day in the form of small businesses competing with the major players. SEO is a particularly challenging arena, so web scraping can be used to research specific search terms, title tags, targeted keywords, and backlinks. This valuable data can be used to map out an effective strategy that will get content ranked high in search results. 
Academic Research
Researchers at academic institutions are in an enviable position in the modern age as the internet is giving them an almost unlimited trove of data that can be used for academic papers and research studies. When data is public, it is one of the signs that should encourage communication about genuine web scraping to benefit the wider society.

Ethical Web Scraping

At Oxylabs, we want to get the message out there that web scraping can be used positively. There are transparent ways to get the job done so individuals and businesses can get the data they need to drive their businesses forward. 
Here are some guidelines to follow to keep the playing field fair for those who gather data and the websites that provide it:
  1. Only scrape publicly-available web pages. 
  2. Ensure that the data is requested at a fair rate that doesn’t compromise the server or is confused for a DDoS attack.
  3. Respect the data obtained and any privacy issues relevant to the source website.
  4. Scrape with the intent to add value and/or context to the data with the end user’s interest in mind (such as the “fake” news example above).
  5. Study the target website's legal documents to determine whether you will legally accept their terms of service and if you will do so - whether you will not breach these terms.
To add, every proxy resource provider uses proxies of varying quality from different sources. Hence, partnering up with a verified and trusted proxy service provider completes the equation. A synergistic relationship based on transparency and cooperation can balance the equation for all parties involved and drive the evolution of Big Data forward for the benefit of everyone. 
While the above recommendations aren’t the law, they can start the conversation in forming a code of ethics that may prove to be useful in further legitimizing the use of web scraping for purposes that add value to the internet. Partnering up with a trusted proxy provider that embodies similar core values is essential to that process. 

A Final Word

The internet is still the most significant source of Big Data known to humankind, and that’s not going to change anytime soon. In this day and age, it would be foolish to let it blind us to the endless possibilities that lay before us. Everyone can benefit from harnessing the insights hidden in the never-ending supply of Big Data, thanks to web scraping practice.
With ethics in mind, web scraping can open up new worlds of information that will connect people, organizations, and disciplines. Like any tool we can choose the best intention, and make Big Data a force for good.

References


Written by cerniauskas | CEO @ Oxylabs. Covering topics on web scraping, big data, machine learning, tech trends & business leadership.
Published by HackerNoon on 2020/07/18