Note: This is Part 2 of our six-piece series on Advanced Web Scraping. Want to start from the beginning? Catch up by reading Part 1!
If you’re into web scraping, you’re probably already well-acquainted with most of the usual challenges. But with the Web changing at warp speed—especially thanks to the AI boom—there are tons of new variables in the scraping game. To level up as a web scraping expert, you must get a grip on them all! 🔍
In this guide, you’ll discover advanced web scraping techniques and crack the code on how to scrape today’s modern sites—even with SPAs, PWAs, and AI in the mix! 💪
Back in the day, websites were just a bunch of static pages managed by a web server. Fast-forward to now, and the Web’s more like a bustling metropolis. 🌇
We’ve jumped from server-side to client-side rendering. Why? Because our mobile devices are more powerful than ever, so letting them handle some of the load just makes sense. 📲
Sure, you probably already know all that—but to get where we’re at today, we gotta know where we started. Today, the Internet is a mix of static sites, dynamic server-rendered sites, SPAs, PWAs, AI-driven sites, and more. 🕸️
And don’t worry—SPA, PWA, and AI aren’t secret acronyms for government agencies. Let’s break down this alphabet soup. 🥣
SPA (Single Page Application) doesn’t mean it’s literally one page, but it does handle navigation without reloading everything each time. Think of it like Netflix: click around and watch the content change instantly without that annoying page reload. 🍿
It’s smooth, fast, and lets you stay in the flow.
PWAs are like web apps on steroids. 💊
Technically speaking, a PWA (Progressive Web App) uses cutting-edge web capabilities to give you that native app feel right from your browser.
Offline functionality? ✅
Push notifications? ✅
Near-instant loading through caching? ✅
In most cases, you can also install PWAs directly on your device!
AI-powered sites bring a sprinkle of machine learning magic. From dynamically generated designs and chatbots to personalized recommendations, these sites make you feel like the site knows you. 🤖 ✨
It’s not just browsing. It’s an interactive experience that adapts to you.
These categories? Not mutually exclusive!
You can layer them like a parfait. 🍨 A PWA can also be an SPA, and both can leverage AI to make things smarter and faster. So yeah, it can get a little wild out there!
Long story short, the rise of SPAs, PWAs, and AI-powered sites has made the web a whole lot more complex. And, yep, that means web scraping is more challenging than ever, with a ton of new factors to consider. 😣
And what about Web 3.0? Well, it’s a bit early to say the impact it’ll have on web scraping, but some experts are already speculating…
To get a head start on bypassing today’s most common (and annoying) obstacles in modern site scraping, take a look at this video from our friend Forrest Knight. Chapter 3 covers exactly what you’re looking for. 👇
Let’s now see what you need to consider when performing advanced web scraping on modern sites!
⚠️ Warning: Don’t get discouraged if the first few tips sound familiar—keep going, because there are plenty of fresh insights as we get deeper! 🧠
These days, most sites are either fully rendered on the client side via JavaScript (that’s client-side rendering) or have dynamic sections that load data or change the DOM of the page as you interact with it.
If you’ve used a browser in the last decade, you know what we’re talking about. This dynamic data retrieval isn’t magic—it’s powered by AJAX technology! (And no, not the football club Ajax 🔴⚪—different kind of magic here 😉)
You probably already know what AJAX is, but if not, MDN’s docs are a great place to start. Now, is AJAX a big deal for web scraping?
With browser automation tools like Playwright, Selenium, or Puppeteer, you can command your script to load a webpage in a browser, including AJAX requests. Just grab one of the best headless browser tools, and you’re set!
For more guidance, read our full tutorial on scraping dynamic sites in Python.
🚨 But, wait… There’s a pro tip! 🚨
Most AJAX-based pages pull in dynamic data through API calls. You can catch these requests by opening the Network tab in your browser’s DevTools while loading a page:
You’ll either see:
One or more REST APIs to different endpoints.
One or more GraphQL API calls to a single endpoint, which you can query using GraphQL.
In both cases, this opens the door to scraping by targeting those API calls directly. Just intercept and pull that data—as easy as that! 🎉
See the video below for a quick walkthrough:
Web pages are more interactive than ever, with designers constantly experimenting with new ways to keep us engaged. On the other hand, some interactions, like infinite scrolling, have even become standard. (Ever found yourself endlessly scrolling through Netflix? Make sure to check out the right series!)
So, how do we tackle all those tricky interactions in web scraping? Drumroll... 🥁
With browser automation tools! (Yeah, again! 🎉)
The most modern ones, like Playwright, have built-in methods to handle common interactions. And when something unique pops up that they don’t cover? You can usually add custom JavaScript code to do the trick.
In particular:
Playwright offers the evaluate() method to run custom JS right on the page.
Selenium provides execute_script(), which lets you execute JavaScript in the browser.
We know, you probably have a handle on these basics already, so no need to dive deep here. But if you want the full scoop, see these complete guides:
Here's where things get spicy! 🌶️
PWAs are built to work offline and rely heavily on caching. While that's great for end users, it creates a headache for web scraping because you want to retrieve fresh data.
So, how do you handle caching when scraping—especially when dealing with a PWA? Well, most of the time, you'll be using a browser automation tool. After all, PWAs are typically client-side rendered and/or rely on dynamic data retrieval.
The good news? Browser automation tools start fresh browser sessions every time you run them. And in the case of Puppeteer and Playwright, they even launch in incognito mode by default. But here’s the catch: Incognito/New sessions aren’t cache- or cookie-free! 🤯
The more you interact with a site in your scraping script, the more likely the browser will start caching requests—even in incognito mode. To tackle the issue, you can restart the headless browser periodically.
Or, with Puppeteer, you can disable caching entirely with a simple command:
await page.setCacheEnabled(enabled)
But what if the server behind the PWA is caching data on its end? Well, that’s a whole other beast… 👹
Unfortunately, there’s not much you can do about server-side caching. At the same time, some servers serve cached responses based on the headers in incoming requests. Thus, you can try to change some request headers, like the User-Agent. 🔄
Discover the best User-Agent for web scraping!
Ever wondered why websites seem to show you content you're almost too interested in? It’s not magic—it's machine learning at work. 💡
Today, more and more web pages serve personalized content tailored to your preferences. Based on your searches, site interactions, purchases, views, and other online behaviors, ML algorithms understand what you like and web pages serve content accordingly.
Is it useful? Absolutely—a huge time-saver! ⏱️
Is it ethical? Well, you did agree to those terms of service, so... let’s go with yes. 🤷
But here’s the challenge for web scraping: in the old days, you’d only worry about sites changing their HTML structure occasionally. Now, web pages change continuously, potentially delivering a different experience every single time you visit.
So, how do you handle this? To get consistent results, you can start your browser automation tools with pre-stored sessions, which help ensure the content stays predictable. Tools like Playwright provide a BrowserContext object also for that purpose:
const browserContext = await browser.newContext({
// load the context storage state from a JSON file
storageState: "session.json"
});
const page = await context.newPage();
To avoid personalized content, you should also aim to standardize parameters like language and IP location—as these, too, can influence the content displayed. 🗺️
And here’s a final tip: Always inspect sites in incognito mode before scraping. That way, you get a “blank slate” session, free of personalized data. This helps you better understand the content normally available on the site. 🥷
Now, the hot topic of the moment: AI! 🔥
AI is rewriting the playbook on how we build sites. What used to take months, now it’s happening in seconds or minutes! ⏱️
For a quick overview of how AI-based web-building tech is transforming the game, see the following video:
The result? Sites are changing layout, structure, and design faster than ever. Even content is getting the AI treatment, with editors churning out massive amounts of text, images, and videos in a flash. ⚡
And that's only the beginning…
Imagine a future where sites can generate pages dynamically based on what you click or search for. It’s like they’re morphing in real time, adapting to each user.
All that randomness is a nightmare for traditional web scraping scripts. 😱
Here’s the flip side, though. Just as AI speeds up website updates, you can use AI-powered web scraping to adapt your scripts on the fly. Want to dive in deeper? Read out a guide on AI for web scraping.
Another possible solution, especially to avoid errors, is to create independent processes that monitor pages for changes, alerting you before your script breaks. For example, through a Telegram message. 📩
See how to build a page change Telegram notification bot.
Almost every solution we’ve covered so far assumes that modern sites are highly interactive. That means if you want to scrape them, you must use a browser automation tool. But there’s a weak spot in this approach: the browser itself!
Browsers aren't built for scraping! 😲
Sure, you can tweak them with extensions (like with Puppeteer Extra) or implement all the tweaks mentioned above. But with today’s AI-driven bot detection, traditional browsers are increasingly easy to spot, especially when sites embrace advanced anti-scraping tech like user behavior analysis.
So, what's the solution? A powerful scraping browser that:
Runs in headed mode like a regular browser to blend in with real users.
Scales effortlessly in the cloud, saving you time and infrastructure costs.
Integrates rotating IPs from one of the largest, most reliable proxy networks.
Auto-solves CAPTCHAs, manages browser fingerprinting, and customizes cookies and headers, all while handling retries for you.
Works seamlessly with top automation tools like Playwright, Selenium, and Puppeteer.
This isn’t just a futuristic idea. It’s here, and it’s exactly what Bright Data’s Scraping Browser offers. Want a deeper look? See this video:
Now you know what modern web scraping demands—especially when it comes to taking on AI-driven SPAs and PWAs!
You’ve definitely picked up some pro tips here, but remember, this is just Part 2 of our six-part adventure into advanced web scraping! So, keep that seatbelt fastened because we’re about to dive into even more cutting-edge tech, clever solutions, and insider tips.
Next stop? Optimization secrets for faster, smarter scrapers! 🚀