Improve Early Failure Detection (EFD) in Web Scraping With Benchmark Data

The “Completeness” Issue

One of the most common issues in Quality Assurance (QA) for web scraping, yet disarming in its triviality, is ensuring the scraper collects all items from the target website.

It’s a problem of continuously calibrating a tool that measures a constantly changing object.

Why Does It Happen?

From the easiest to detect to the most challenging (which does not mean easy to solve..), we have the following causes of incomplete data collection:

the scraper gets blocked by anti-bot systems
the scraper gets lost in A/B testing versions of the website
the scraper is limited by the paging limits of the website/API
the scraper is overlooking portions of the website (sometimes created after the scraper was designed)

As a result, we have a partial data collection.

Early Failure Detection

Most web scraping use cases have Service Level Agreements (SLAs) that can result in penalty clauses. Quality Assurance aims to detect the potential issue as early as possible - before SLAs are violated.

To do so, we need to increase the Failure Detection Rate (FDR) and reduce the False Alarm Rate (FAR). With a cherry on top: keeping costs low.

How To Detect Failures

Time series analysis

We can monitor item count over time and trigger an alert when this drops. It’s a good starting point, but while effective with sudden changes (i.e., a drop by 50%), it is less functional when variations are incremental, generating either too many false alarms (FAR) or failing to detect errors.

This happens because:

Websites change quickly, especially when large
We have no history in data to understand trends or seasonalities, which would allow applying more sophisticated time-series algorithms.

The most critical limitation of this method is that it does not spot missing items if they have never been captured by the scraper.

Example

A fashion e-commerce website might have a “sales” category of the website that only pops up during official sales periods. If you build your scraper when the section is not there, you might never realize you are missing the sales items.

Manual Inspection (Ground Truth)

The manual inspection gives the highest confidence in results, as discussed in this post. It provides a so-called Ground Truth, and you can benchmark the item count you collected against the item count performed manually.

Limitations:

Hardly feasible for large websites (you can reliably tell how many items are on Allbirds website, but not so reliably on Farfetch).
Hardly scalable: It may work for a few websites, and it is done rarely, but things go uphill quickly when you need multiple large websites with a high frequency (read the Data Boutique approach on this in the article on Ground Truth Testing).

This would keep a good False Alarm Rate (FAR) but not achieve a reasonable Failure Detection Rate (FDR), as frequency would be too low.

Independent Benchmarking

An intelligent way to solve this is to benchmark your result, in terms of item count, against an independent collection.

For this approach to properly work, the benchmark data has to be:

Independent: to reduce the chance of being affected by the same coding biases
Cost-effective: Ça va sans dire, web scraping is costly enough.

An independent data collection is (almost) uncorrelated to your own data collection: it’s correlated because they look at the same object, so a failure of the observed object would indeed cause a loss in both data collections, but on the other hand, they’re results of independent processes, written by am maintained by different teams, with different techniques.

Using a highly reliable data source strongly increases the trustworthiness of results.

Let’s assume your current Failure Detection Rate (FDR) is 90%, meaning your system can automatically detect 90% of the times a scraper collects only partially from the website. Or, in other terms, your dataset, when published, contains 90% of the times a complete collection.

If we assume that the benchmark data is

a) as capable of detecting errors as the production data

b) independent

Using external data for QA would bring the Failure Detection Rate to 99% (probability of the union of two events).

Monitor the total item count on your data collection
Benchmark it with the total item count from the same website on Data Boutique
When your count is lower than the benchmark, you have your failure detection.

Why Data Boutique is a smart fit

Since Data Boutique’s datasets embed manual inspections in their QA process, using Data Boutique’s data as a benchmark is scalable, cost-efficient, and a reliable way to improve the Quality Assurance process (QA) even when you do web-scraping internally because it is very likely that datasets published on Data Boutique exceed those levels of FDR.

The two data structures do not have to be the same: You are only comparing item counts and do not need the same structure, which makes it very easy to implement. Only the granularity has to be comparable.
You can select the frequency for your QA that can be lower than the frequency of your acquisition (if you acquire items daily, you can have only weekly benchmarks, which would still go a very long way in improving data quality tests.
Since Data Boutique’s data is Fractionable (as explained in this post), the cost of buying this data can be very low if compared to all other quality measures.

In other words, even if the data structure of Data Boutique is not a perfect match for your use case, using it for Quality Testing is a very efficient approach.

Join the Project

Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.

More on this project can be found on our Discord channels.

Also published on Data Boutique