Did you know that Playwright allows you to block requests and thus speed up your scraping or testing? You can block certain resource types like images, any requests by domain, or many different ways. Prerequisites For the code to work, you will need . Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit. Python3 installed pip install playwright playwright install Intro to Playwright "is a Python library to automate Chromium, Firefox, and WebKit browsers with a single API." It allows us to browse the Internet with a headless browser programmatically. Playwright Playwright is also available for Node.js, and everything shown below can be done with a similar syntax. Check the for more details. docs Here's how to start a browser (i.e., Chromium) in a few lines, navigate to a page, and get its title. from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://www.zenrows.com/") print(page.title()) # Web Scraping API & Data Extraction - ZenRows page.context.close() browser.close() Logging Network Events Subscribe to events such as “ ” or “ ” and log their content to see what is happening. Since we did not tell Playwright otherwise, it will load the entire page: HTML, CSS, execute Javascript, get images, and so on. Add these two lines before requesting the page to see what's going on. request response page.on("request", lambda request: print( ">>", request.method, request.url, request.resource_type)) page.on("response", lambda response: print( "<<", response.status, response.url)) page.goto("https://www.zenrows.com/") # >> GET https://www.zenrows.com/ document # << 200 https://www.zenrows.com/ # >> GET https://cdn.zenrows.com/images_dash/logo-instagram.svg image # << 200 https://cdn.zenrows.com/images_dash/logo-instagram.svg # ... and many more The entire output is almost 50 lines long, with 24 resource requests. We probably don't need most of those for website scraping, so we will see how to block them and save time and bandwidth. Blocking Resources Why load resources and content that we won't use? Learn how to avoid unnecessary data and network requests with these techniques. Block by Glob Pattern also exposes a method that will execute a handler for each matching route or pattern. Let's say that we don't want SVGs to load. Using a pattern like will match requests ending with that extension. As for the handler, we need no logic for the moment, only to abort the request. For that, we'll use a lambda and the param's abort method. page route "**/*.svg" route page.route("**/*.svg", lambda route: route.abort()) page.goto("https://www.zenrows.com/") Note: according to the official documentation, patterns like "**/*.{png,jpg,jpeg}" should work, but we found otherwise. Anyway, it's doable with the next blocking strategy. Block by Regex If (for some reason 😜) you are into Regex, feel free to use them. But compiling them first is mandatory. We will block three image extensions in this case. Regex are tricky, but they offer a ton of flexibility. import re # ... page.route(re.compile(r"\.(jpg|png|svg)$"), lambda route: route.abort()) page.goto("https://www.zenrows.com/") Now there are 23 requests and only 15 responses. We saved 8 images from being downloaded! Block by Resource Type But what happens if they use "jpeg" extension instead of "jpg"? Or avif, gif, webp? Should we maintain an updated list? Luckily for us, the param exposed in the lambda function above includes the original request and resource type. And one of those types is , perfect! You can access the . route image whole resource type list We'll now match every request ( ) and add conditional logic to the lambda function. In case it is an image, abort the request as before. Else, continue with it as usual. "**/*" page.route("**/*", lambda route: route.abort() if route.request.resource_type == "image" else route.continue_() ) page.goto("https://www.zenrows.com/") Take into consideration that some trackers use images. It is probably not a big deal when scraping or testing, but just in case. Function handler We can also define functions for the handlers instead of using lambdas. That comes in handy in case we need to reuse it or it grows past a single conditional. Suppose that we want to block aggressively now. Looking at the output from the previous runs will show a list of the used resources. We'll add those to a list and then check if the type is in that list. excluded_resource_types = ["stylesheet", "script", "image", "font"] def block_aggressively(route): if (route.request.resource_type in excluded_resource_types): route.abort() else: route.continue_() # ... page.route("**/*", block_aggressively) page.goto("https://www.zenrows.com/") We are entirely in control now, and the versatility is absolute. From , the original URL, the headers, and several other pieces of info are available. routes.request Being even more strict: block everything that is not type. That will effectively prevent anything but the initial HTML from being loaded. document def block_aggressively(route): if (route.request.resource_type != "document"): route.abort() else: route.continue_() There is a single response now! We got the HTML without any other resource being downloaded. We sure have saved a lot of time and bandwidth, right? But... how much exactly? Measuring Performance Boost We can only say that it got better if we could measure the differences. We will take a look at three approaches. But just running a script with several URLs will do. Spoiler: we did that for 10 URLs, 1.3 seconds VS 8.4. HAR File For those of you used to checking the DevTools Network tab, we have good news! Playwright allows HAR recording by providing an extra parameter in the method. As easy as that. new_page page = browser.new_page(record_har_path="playwright_test.har") page.goto("https://www.zenrows.com/") There are some out there, but the easiest way is to use Chrome DevTools. Open the Network tab and click on the import button or drag&drop the HAR file. HAR visualizers Check time! Below is the comparison between two different HAR files. The first one is without blocking (regular navigation). The second one is blocking everything except for the initial document. Almost every resource has a "-1" Status and "Pending" Time on the blocking side. That's DevTools's way of telling us that those were blocked and not downloaded. We can see clearly on the bottom left that we performed fewer requests, and the transferred data amount is a fraction of the original! From 524kB to 17.4kB, a 96% cut. Browser's Performance API The browsers offer an interface to check the that shows how it went for things like timing. Playwright can evaluate Javascript, so we'll use it to print those results. performance The output will be a JSON object with a lot of timestamps. The most straightforward check is to get the difference between and . When blocking, it should be under half a second (i.e., 346ms); for regular navigation, above a second or even two (i.e., 1363ms). navigationStart loadEventEnd page.goto("https://www.zenrows.com/") print(page.evaluate("JSON.stringify(window.performance)")) # {"timing":{"connectStart":1632902378272,"navigationStart":1632902378244, ... As you can see, blocking can be a second faster, even more for slower sites. The less you download, the faster you can scrape! CDP Session Going a step further, we connect directly with . Playwright creates a for us to extract, for example, performance metrics. Chrome DevTools Protocol CDP Session We have to create a client from the page context and start communication with CDP. In our case, we will enable "Performance" before visiting the page and getting the metrics after it. The output will be a JSON-like string with interesting values such as nodes, process time, JS Heap used, and many more. client = page.context.new_cdp_session(page) client.send("Performance.enable") page.goto("https://www.zenrows.com/") print(client.send("Performance.getMetrics")) Conclusion We'd like you to part with three main points: Load only necessary resources. Save time and bandwidth when possible. Measure your efforts and performance before scaling up. Let's not forget that website scraping is a process with multiple steps, one of them being . Those add the processing time and, sometimes, charge per bandwidth. rotating proxies You can achieve precisely the same results while saving time/bandwidth/money. Many times, images or CSS are only overhead. Some others, no need for JS if you only want the initial static content. And if you liked the content, please share it! Also published on: https://www.zenrows.com/blog/blocking-resources-in-playwright