Web Scraping is the process of automatically collecting web data with specialized software.
Every day trillions of GBs are created, making it impossible to keep track of every new data point. Simultaneously, more and more companies worldwide rely on various data sources to nurture their knowledge to gain a competitive advantage. It's not possible to keep up with the pace manually.
That's where Web Scraping comes into play.
As communication between systems is becoming critical, APIs are increasing in popularity. An API is a gate a website exposes to communicate with other systems. They open up functionality to the public. Unfortunately, many services don't provide an API. Others only allow limited functionality.
Web Scraping overcomes this limitation. It collects information all around the internet without the restrictions of an API.
Therefore web scraping is used in varied scenarios:
Price Monitoring
Lead Generation
Market Research
Aggregation:
Inventory and Product Tracking:
SEO (Search Engine Optimization): Keywords' relevance and performance. Competition tracking, brand relevance, new players' rank.
ML/AI - Data Science: Collect massive amounts of data to train machine learning models; image recognition, predictive modeling, NLP.
Bulk downloads: PDFs or massive Image extraction at scale.
Web Scraping works mainly as a standard HTTP client-server communication.
The browser (client) connects to a website (server) and requests the content. The server then returns HTML content, a markup language both sides understand. The browser is responsible for rendering HTML to a graphical interface.
That's it. Easy, isn't it?
There are more content types, but let's focus on this one for now. Let's dig deeper on how the underlying communication works - it'll come in handy later on.
A request is a text the browser sends to the website. It consists of four elements:
When a website responds to a browser, it returns three items.
Up to this point, this reflects an ordinary client-server process. Web Scraping, though, adds a new concept: data extraction.
HTML is just a long text. Once we have the HTML, we want to obtain specific data and structure it to make it usable. Parsing is the process of extracting selected data and organizing it into a well-defined structure.
Technically, HTML is a tree structure. Upper elements (nodes) are parents, and the lower are children. Two popular technologies facilitate walking the tree to extract the most relevant pieces:
The extraction process begins by analyzing a website. Some elements are valuable at first sight. For example, Title, Price, or Description are all easily visible on the screen. Other information, though, is only visible in the HTML code:
Hidden Inputs on Amazon Products
Asynchronous Request on Instagram
JSON within HTML on Alibaba
HTML attributes on Craiglist
Once data is structured, databases store it for later use. At this stage, we can export it to other formats such as Excel, PDF or transform it to make them available to other systems.
Such a valuable process does not come free of obstacles, though.
Websites actively avoid being tracked/scraped. It's common for them to build protective solutions. High traffic websites put advanced industry-level anti-scraping solutions into place. This protection makes the task extremely challenging.
These are some of the challenges web scrapers face when dealing with relevant websites (low traffic websites are usually low value and thus have weak anti-scraping systems):
IP Rate Limit
All devices connected to the internet have an identification address, called IP. It's like an ID Card. Websites use this identifier to measure the number of requests of a device and try to block it. Imagine an IP requesting 120 pages per minute. Two requests per second. Real users cannot browse at such a pace. So to scrape at scale, we need to bring a new concept: proxies.
Rotating Proxies
A proxy, or proxy server, is a computer on the internet with an IP address. It intermediates between the requestor and the website. It permits hiding the original request IP behind a proxy IP and tricks the website into thinking it comes from another place. They're typically used as vast pools of IPs and switched between them depending on various factors. Skilled scrapers tune this process and select proxies depending on the domain, geolocation, etc.
Headers / Cookies validation
Remember Request/Response Headers? A mismatch between the expected and resulting values tells the website something is wrong. The more headers shared between browser and server, the harder it gets for automated software to communicate smoothly without being detected. It gets increasingly challenging when websites return the "Set-Cookie" header that expects the browser to use it in the following requests.
Ideally, you'd want to make requests with as few headers as possible. Unfortunately, something it's not possible leading to another challenge:
Reverse Engineering Headers / Cookies generation
Advanced websites don't respond if Headers and Cookies are not in place, forcing us to reverse-engineering. Reverse engineering is the process of understanding how a process' built to try to simulate it. It requires tweaking IPs, User-Agent (browser identification), Cookies, etc.
Javascript Execution
Most websites these days rely heavily on Javascript. Javascript is a programming language executed on the browser. It adds extra difficulty to data collection as a lot of tools don't support Javascript. Websites do complex calculations in Javascript to ensure a browser is really a browser. Leading us to:
Headless Browsers
A headless browser is a web browser without a graphical user interface controlled by software. It requires a lot of RAM and CPU, making the process way more expensive. Selenium and Puppeteer (created by Google) are two of the most used tools for the task. You guessed: Google is the largest web scraper in the world.
Captcha / reCAPTCHA (Developed by Google)
Captcha is a challenge test to determine whether or not the user is human. It used to be an effective way to avoid bots. Some companies like Anti-Captcha and 2Captcha offer solutions to bypass Captchas. They offer OCR (Optical Character Recognition) services and even human labor to solve the puzzles.
Pattern Recognition
When collecting data, you may feel tempted to go the easy way and follow a regular pattern. That's a huge red flag for websites. Arbitrary requests are not reliable either. How's someone supposed to land on page 8? It should've certainly been on page 7 before. Otherwise, it indicates that something's weird. Nailing the right path is tricky.
Hopefully, this grasps the overview of how data automation looks. We could stay forever talking about it, but we will get deeper into details in the coming posts.
Data collection at scale is full of secrets. Keeping up the pace is arduous and expensive. It's hard, very hard.
A preferred solution is to use batteries included services like ZenRows that turn websites into data. We offer a hassle-free API that takes care of all the work, so you only need to care about the data. We urge you to try it for FREE.
We are delighted to help and even tailor-made a custom solution that works for you.
Disclaimer: Aurken Bilbao is Founder of Zenrows.com
Previously published at https://www.zenrows.com/blog/what-is-web-scraping/