Note: This is Part 2 of our six-piece series on Advanced Web Scraping. Want to start from the beginning? Catch up by reading Part 1!
If youâre into web scraping, youâre probably already well-acquainted with most of the usual challenges. But with the Web changing at warp speedâespecially thanks to the AI boomâthere are tons of new variables in the scraping game. To level up as a web scraping expert, you must get a grip on them all! đ
In this guide, youâll discover advanced web scraping techniques and crack the code on how to scrape todayâs modern sitesâeven with SPAs, PWAs, and AI in the mix! đŞ
What's the Deal with SPAs, PWAs, and AI-Powered Sites?
Back in the day, websites were just a bunch of static pages managed by a web server. Fast-forward to now, and the Webâs more like a bustling metropolis. đ
Weâve jumped from server-side to client-side rendering. Why? Because our mobile devices are more powerful than ever, so letting them handle some of the load just makes sense. đ˛
Sure, you probably already know all thatâbut to get where weâre at today, we gotta know where we started. Today, the Internet is a mix of static sites, dynamic server-rendered sites, SPAs, PWAs, AI-driven sites, and more. đ¸ď¸
And donât worryâSPA, PWA, and AI arenât secret acronyms for government agencies. Letâs break down this alphabet soup. đĽŁ
SPA: Single-Page Application
SPA (Single Page Application) doesnât mean itâs literally one page, but it does handle navigation without reloading everything each time. Think of it like Netflix: click around and watch the content change instantly without that annoying page reload. đż
Itâs smooth, fast, and lets you stay in the flow.
PWA: Progressive Web App
PWAs are like web apps on steroids. đ
Technically speaking, a PWA (Progressive Web App) uses cutting-edge web capabilities to give you that native app feel right from your browser.
-
Offline functionality? â
-
Push notifications? â
-
Near-instant loading through caching? â
In most cases, you can also install PWAs directly on your device!
AI-Powered Sites
AI-powered sites bring a sprinkle of machine learning magic. From dynamically generated designs and chatbots to personalized recommendations, these sites make you feel like the site knows you. đ¤ â¨
Itâs not just browsing. Itâs an interactive experience that adapts to you.
Hereâs the Fun Part
These categories? Not mutually exclusive!
You can layer them like a parfait. đ¨ A PWA can also be an SPA, and both can leverage AI to make things smarter and faster. So yeah, it can get a little wild out there!
Advanced Data Scraping: Navigating Todayâs Web Jungle
Long story short, the rise of SPAs, PWAs, and AI-powered sites has made the web a whole lot more complex. And, yep, that means web scraping is more challenging than ever, with a ton of new factors to consider. đŁ
And what about Web 3.0? Well, itâs a bit early to say the impact itâll have on web scraping, but some experts are already speculatingâŚ
To get a head start on bypassing todayâs most common (and annoying) obstacles in modern site scraping, take a look at this video from our friend Forrest Knight. Chapter 3 covers exactly what youâre looking for. đ
https://www.youtube.com/watch?v=vxk6YPRVg_o&embedable=true
Letâs now see what you need to consider when performing advanced web scraping on modern sites!
â ď¸ Warning: Donât get discouraged if the first few tips sound familiarâkeep going, because there are plenty of fresh insights as we get deeper! đ§
Dynamic Content via AJAX and Client-Side Rendering
These days, most sites are either fully rendered on the client side via JavaScript (thatâs client-side rendering) or have dynamic sections that load data or change the DOM of the page as you interact with it.
If youâve used a browser in the last decade, you know what weâre talking about. This dynamic data retrieval isnât magicâitâs powered by AJAX technology! (And no, not the football club Ajax đ´âŞâdifferent kind of magic here đ)
You probably already know what AJAX is, but if not, MDNâs docs are a great place to start. Now, is AJAX a big deal for web scraping?
With browser automation tools like Playwright, Selenium, or Puppeteer, you can command your script to load a webpage in a browser, including AJAX requests. Just grab one of the best headless browser tools, and youâre set!
For more guidance, read our full tutorial on scraping dynamic sites in Python.
đ¨ But, wait⌠Thereâs a pro tip! đ¨
Most AJAX-based pages pull in dynamic data through API calls. You can catch these requests by opening the Network tab in your browserâs DevTools while loading a page:
Youâll either see:
-
One or more REST APIs to different endpoints.
-
One or more GraphQL API calls to a single endpoint, which you can query using GraphQL.
In both cases, this opens the door to scraping by targeting those API calls directly. Just intercept and pull that dataâas easy as that! đ
See the video below for a quick walkthrough:
https://www.youtube.com/watch?v=ji8F8ppY8bs&embedable=true
Lazy Loading, Infinite Scrolling, and Dynamic User Interaction
Web pages are more interactive than ever, with designers constantly experimenting with new ways to keep us engaged. On the other hand, some interactions, like infinite scrolling, have even become standard. (Ever found yourself endlessly scrolling through Netflix? Make sure to check out the right series!)
So, how do we tackle all those tricky interactions in web scraping? Drumroll... đĽ
With browser automation tools! (Yeah, again! đ)
The most modern ones, like Playwright, have built-in methods to handle common interactions. And when something unique pops up that they donât cover? You can usually add custom JavaScript code to do the trick.
In particular:
-
Playwright offers the evaluate() method to run custom JS right on the page.
-
Selenium provides execute_script(), which lets you execute JavaScript in the browser.
We know, you probably have a handle on these basics already, so no need to dive deep here. But if you want the full scoop, see these complete guides:
Content Caching in PWAs
Here's where things get spicy! đśď¸
PWAs are built to work offline and rely heavily on caching. While that's great for end users, it creates a headache for web scraping because you want to retrieve fresh data.
So, how do you handle caching when scrapingâespecially when dealing with a PWA? Well, most of the time, you'll be using a browser automation tool. After all, PWAs are typically client-side rendered and/or rely on dynamic data retrieval.
The good news? Browser automation tools start fresh browser sessions every time you run them. And in the case of Puppeteer and Playwright, they even launch in incognito mode by default. But hereâs the catch: Incognito/New sessions arenât cache- or cookie-free! đ¤Ż
The more you interact with a site in your scraping script, the more likely the browser will start caching requestsâeven in incognito mode. To tackle the issue, you can restart the headless browser periodically.
Or, with Puppeteer, you can disable caching entirely with a simple command:
await page.setCacheEnabled(enabled)
But what if the server behind the PWA is caching data on its end? Well, thatâs a whole other beast⌠đš
Unfortunately, thereâs not much you can do about server-side caching. At the same time, some servers serve cached responses based on the headers in incoming requests. Thus, you can try to change some request headers, like the User-Agent. đ
Discover the best User-Agent for web scraping!
Context-Specific Content
Ever wondered why websites seem to show you content you're almost too interested in? Itâs not magicâit's machine learning at work. đĄ
Today, more and more web pages serve personalized content tailored to your preferences. Based on your searches, site interactions, purchases, views, and other online behaviors, ML algorithms understand what you like and web pages serve content accordingly.
Is it useful? Absolutelyâa huge time-saver! âąď¸
Is it ethical? Well, you did agree to those terms of service, so... letâs go with yes. đ¤ˇ
But hereâs the challenge for web scraping: in the old days, youâd only worry about sites changing their HTML structure occasionally. Now, web pages change continuously, potentially delivering a different experience every single time you visit.
So, how do you handle this? To get consistent results, you can start your browser automation tools with pre-stored sessions, which help ensure the content stays predictable. Tools like Playwright provide a BrowserContext object also for that purpose:
const browserContext = await browser.newContext({
// load the context storage state from a JSON file
storageState: "session.json"
});
const page = await context.newPage();
To avoid personalized content, you should also aim to standardize parameters like language and IP locationâas these, too, can influence the content displayed. đşď¸
And hereâs a final tip: Always inspect sites in incognito mode before scraping. That way, you get a âblank slateâ session, free of personalized data. This helps you better understand the content normally available on the site. đĽˇ
AI-Generated Sites and Web Pages
Now, the hot topic of the moment: AI! đĽ
AI is rewriting the playbook on how we build sites. What used to take months, now itâs happening in seconds or minutes! âąď¸
For a quick overview of how AI-based web-building tech is transforming the game, see the following video:
https://www.youtube.com/watch?v=z9Asx8vdyP8&embedable=true
The result? Sites are changing layout, structure, and design faster than ever. Even content is getting the AI treatment, with editors churning out massive amounts of text, images, and videos in a flash. âĄ
And that's only the beginningâŚ
Imagine a future where sites can generate pages dynamically based on what you click or search for. Itâs like theyâre morphing in real time, adapting to each user.
All that randomness is a nightmare for traditional web scraping scripts. đą
Hereâs the flip side, though. Just as AI speeds up website updates, you can use AI-powered web scraping to adapt your scripts on the fly. Want to dive in deeper? Read out a guide on AI for web scraping.
Another possible solution, especially to avoid errors, is to create independent processes that monitor pages for changes, alerting you before your script breaks. For example, through a Telegram message. đŠ
See how to build a page change Telegram notification bot.
AI-Bot Detection: The Mother of All Bot Protection Technologies
Almost every solution weâve covered so far assumes that modern sites are highly interactive. That means if you want to scrape them, you must use a browser automation tool. But thereâs a weak spot in this approach: the browser itself!
Browsers aren't built for scraping! đ˛
Sure, you can tweak them with extensions (like with Puppeteer Extra) or implement all the tweaks mentioned above. But with todayâs AI-driven bot detection, traditional browsers are increasingly easy to spot, especially when sites embrace advanced anti-scraping tech like user behavior analysis.
So, what's the solution? A powerful scraping browser that:
-
Runs in headed mode like a regular browser to blend in with real users.
-
Scales effortlessly in the cloud, saving you time and infrastructure costs.
-
Integrates rotating IPs from one of the largest, most reliable proxy networks.
-
Auto-solves CAPTCHAs, manages browser fingerprinting, and customizes cookies and headers, all while handling retries for you.
-
Works seamlessly with top automation tools like Playwright, Selenium, and Puppeteer.
This isnât just a futuristic idea. Itâs here, and itâs exactly what Bright Dataâs Scraping Browser offers. Want a deeper look? See this video:
https://www.youtube.com/watch?v=kuDuJWvho7Q&embedable=true
Final Thoughts
Now you know what modern web scraping demandsâespecially when it comes to taking on AI-driven SPAs and PWAs!
Youâve definitely picked up some pro tips here, but remember, this is just Part 2 of our six-part adventure into advanced web scraping! So, keep that seatbelt fastened because weâre about to dive into even more cutting-edge tech, clever solutions, and insider tips.
Next stop? Optimization secrets for faster, smarter scrapers! đ