❗Disclaimer: This is Part 3 of our six-piece series on Advanced Web Scraping. New to the series? Start from the beginning by reading Part 1!
In Part 2 of our Advanced Web Scraping series, you learned how to scrape data from SPAs, PWAs, and AI-powered sites. By now, you should have all the knowledge needed to build a scraper that works against most modern websites.
What’s next? Time to optimize your scraper with some pro scraping tips and tricks!
Building a web scraper? It’s all about scripting. 👨💻
And let’s be honest—if you’ve ever written code, you know scripting isn’t that hard most of the time. A few lines here, a for
there, and boom, you’re scraping data like a pro. Sounds simple, right? 😄
But here’s the problem: the simplicity of writing a small scraper can lull you into a false sense of security. Why bother with proper comments, error handling, logs, or even neat indentation when it’s just a dozen lines of code anyone can read?
We get it—why overengineer something that doesn’t need it? Overengineering is the enemy of progress. But what happens when you need to scale your scrapers to multiple pages or even entire sites? 🤔
That’s when your quick-and-dirty, spaghetti-coded scraper falls apart! 🍝
Here’s why you need some advanced web scraping tips.
You must have already heard the usual web scraping tips: prioritize pages with critical data first, randomize your requests, and so on. Great advice—but let’s be honest, those tricks are old news. 📰
When you’re dealing with more advanced scenarios, those basics might not cut it. If you really want to level up your scraping game, you’ll need to explore some next-level techniques.
Ready? Buckle up—it’s time to take your web scraping skills to the next level! 💪
⚠️ Warning: Don’t worry if some of the tips feel familiar—keep going! There are plenty of interesting insights as you dive deeper! 🤿
One of the most common mistakes in web scraping is forgetting that the Internet isn’t some magical, infallible technology. When you send a request to a site, a whole range of things can (and will, at some point) go wrong. ❌
Let’s look at some common scenarios:
Your Wi-Fi or connection might hiccup momentarily
The server hosting the website may be unavailable
The page you’re looking for may no longer exist
The target site might be experiencing a temporary slowdown, leading to a timeout error
Now, mix in data parsing, preprocessing, and exporting to a database, and you’ve got a perfect recipe for chaos. 💥
So, what’s the solution? Error handling! 🛡️
Error handling is your best friend in web scraping. Your script will likely process dozens (or thousands) of pages, and one single error shouldn’t bring your whole operation crashing down.
Remember that the try ... catch
block is your friend. Use it to wrap your requests and processing logic. Also, keep in mind that most HTTP libraries don’t raise exceptions for bad HTTP responses (like 404
or 500
). 😲
If you’re not familiar with HTTP status codes, see the video below:
For instance, in Python's requests library you need to manually check the response status code as follows:
import requests
response = requests.get("https://example.com")
if response.status_code == 200:
# handle the successful response...
else:
# handle the error response...
Or, equivalently, use the raise_for_status() method:
import requests
try:
response = requests.get("https://example.com")
# raises an HTTPError for bad responses (4xx or 5xx)
response.raise_for_status()
# handle the successful response...
except requests.exceptions.HTTPError as http_err:
# handle an HTTP error...
except requests.exceptions.RequestException as req_err:
# handle a request error...
Your advanced web scraping script should not only be able to handle errors but also recover from them. Since most errors related to web scraping are tied to making web requests, you can significantly improve your scraper's effectiveness by implementing retryable requests.
The concept is simple: if a request fails, you try it again—one, two, three, or more times—until it's successful. 🔄
But here's the catch: since one of the most common reasons for a failed request is the target server being temporarily down or slow, you don’t want to overwhelm it by sending the same request repeatedly in a short period of time.
If a request fails now, it's likely to fail again immediately. That’s where exponential backoff comes into play!
Instead of retrying instantly, this technique gradually increases the time between retries, improving your chances of success by giving the target server time to recover. ⏳
While you can manually implement simple retry strategies with custom code, many HTTP clients come with built-in utilities or libraries to handle retries automatically. For example, Axios offers the axios-retry library, which you can use like this:
const axios = require("axios");
const axiosRetry = require("axios-retry");
axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay });
axios.get('https://example.com')
.then(response => console.log(response.data))
.catch(error => console.log("Request failed:", error));
Similarly, Python’s urllib3
package comes with a Retry class that integrates seamlessly with most Python HTTP clients.
When inspecting elements in the DevTools, you might be tempted to right-click and select the "Copy selector" option:
But be warned, the result might look something like this:
#__next > div > main > div.sc-d7dc08c8-0.fGqCtJ > div.sc-93e186d7-0.eROqxA > h1
That's definitely not ideal for web scraping….
The problem? Overly specific selectors like these can break easily when the page structure changes. The more detailed your selector, the more fragile it becomes.
To make your web scraping more resilient, you must keep your selectors flexible. Instead of relying on style-related classes (which change all the time), focus on attributes that are less likely to change, like id
, data-
, or aria-
. Most of those attributes are meant for testing and accessibility, so they tend to remain consistent over time. 💡
And while CSS selectors are easier to read and understand, XPath offers more power. But don’t worry—you can often achieve the same results with simple CSS selectors, saving you from needing complex XPath code. 😌
For more information on that, take a look at our guide on XPath vs CSS selectors!
Parsing HTML pages takes time and resources, particularly if you're dealing with a large, nested DOM. If your scraper is only parsing a few pages, it's not such a big deal.
Now, what happens when your scraping operation scales up and you have to retrieve data from millions of pages? That small overhead can quickly drain server resources and add hours to your total scraping time. ⏳
To get a deeper understanding, refer to these resources:
Looking for a full comparison? Read our article on the best HTML parsers.
The good news? Switching from one parser to another isn't that difficult. For example, in BeautifulSoup, it's just a simple parameter change:
from bs4 import BeautifulSoup
# or using html.parser
soup = BeautifulSoup(html_content, "html.parser")
# or using lxml parser
soup = BeautifulSoup(html_content, "lxml")
And what about HTML parsers built into browsers like Chrome? 🤔
Find out more in the video below:
HTTP/2 is an updated version of HTTP that allows multiple requests over a single connection. This reduces latency and can improve the overall performance of your scraping task.
To check if a site supports HTTP/2, simply open DevTools in your browser, go to the “Network” tab, and look for the “Protocol” column—if it says h2
, the site is using HTTP/2:
Unfortunately, not all HTTP clients and scraping libraries support HTTP/2. However, tools like HTTPX for Python offer full support for HTTP/2.
Web scraping is mostly an I/O-bound task—you send requests to the server, wait for the response, process the data, and repeat. During the wait time, your scraper is basically idle, which is inefficient.
The solution? Parallelism or concurrency!
By sending multiple requests at once, you can minimize those dead times and optimize network usage.
🚨 But be careful! 🚨
Bombarding a server with too many simultaneous requests can lead to rate limiting or getting your IP banned—two popular anti-scraping measures. 😬
Pro tip: You can also parallelize parsing tasks, especially if you're using multiple CPUs, which will speed up the data extraction process. ⚡
AI-based adaptive algorithms learn from patterns in data and HTML page structures, adjusting their behavior in real-time to stay on top of changes. 😮
That’s a game-changer for web scraping! 🤯
When websites update their layout or deploy anti-bot measures, these algorithms can quickly adapt, ensuring your scraper keeps running smoothly. 🧠
In short, they make scrapers smarter, helping you extract data efficiently—even when the site throws unexpected curveballs. ⚾ With adaptive algorithms, it’s like having a scraper that evolves over time!
Learn more in Chapter 4 of this video by Forrest Knight:
Sure, all the tips and tricks we’ve mentioned so far can make your scraper faster, more reliable, robust, and effective. But let’s be real—they also bring a lot of complexity. 😅
The good news is that most of these lessons apply to the great majority of scraping projects. So, instead of coding everything from scratch, you could use pre-built functions to tackle specific tasks. That's exactly what Bright Data’s Scraping Functions offer!
With 73+ ready-made JavaScript functions, users have built over 38K scrapers operating across 195+ countries. That’s a ton of scraping power! 📈
Speed up your development with a runtime environment designed to scrape, unlock, and scale web data collection effortlessly:
Now you know how to level up your scraper with insights from experienced scraping developers!
Remember that this is only Part 3, so we’re just halfway through our six-part journey into advanced web scraping! Keep that seatbelt fastened because we’re about to dive into even more cutting-edge tech, clever solutions, and insider tips.
Next stop? Harnessing the power of AI-driven proxy management! 🌐