One of the most common issues in Quality Assurance (QA) for web scraping, yet disarming in its triviality, is ensuring the scraper collects all items from the target website.
It’s a problem of continuously calibrating a tool that measures a constantly changing object.
From the easiest to detect to the most challenging (which does not mean easy to solve..), we have the following causes of incomplete data collection:
As a result, we have a partial data collection.
Most web scraping use cases have Service Level Agreements (SLAs) that can result in penalty clauses. Quality Assurance aims to detect the potential issue as early as possible - before SLAs are violated.
To do so, we need to increase the Failure Detection Rate (FDR) and reduce the False Alarm Rate (FAR). With a cherry on top: keeping costs low.
We can monitor item count over time and trigger an alert when this drops. It’s a good starting point, but while effective with sudden changes (i.e., a drop by 50%), it is less functional when variations are incremental, generating either too many false alarms (FAR) or failing to detect errors.
This happens because:
The most critical limitation of this method is that it does not spot missing items if they have never been captured by the scraper.
Example
A fashion e-commerce website might have a “sales” category of the website that only pops up during official sales periods. If you build your scraper when the section is not there, you might never realize you are missing the sales items.
The manual inspection gives the highest confidence in results, as discussed in this post. It provides a so-called Ground Truth, and you can benchmark the item count you collected against the item count performed manually.
Limitations:
This would keep a good False Alarm Rate (FAR) but not achieve a reasonable Failure Detection Rate (FDR), as frequency would be too low.
An intelligent way to solve this is to benchmark your result, in terms of item count, against an independent collection.
For this approach to properly work, the benchmark data has to be:
An independent data collection is (almost) uncorrelated to your own data collection: it’s correlated because they look at the same object, so a failure of the observed object would indeed cause a loss in both data collections, but on the other hand, they’re results of independent processes, written by am maintained by different teams, with different techniques.
Using a highly reliable data source strongly increases the trustworthiness of results.
Let’s assume your current Failure Detection Rate (FDR) is 90%, meaning your system can automatically detect 90% of the times a scraper collects only partially from the website. Or, in other terms, your dataset, when published, contains 90% of the times a complete collection.
If we assume that the benchmark data is
a) as capable of detecting errors as the production data
b) independent
Since Data Boutique’s datasets embed manual inspections in their QA process, using Data Boutique’s data as a benchmark is scalable, cost-efficient, and a reliable way to improve the Quality Assurance process (QA) even when you do web-scraping internally because it is very likely that datasets published on Data Boutique exceed those levels of FDR.
The two data structures do not have to be the same: You are only comparing item counts and do not need the same structure, which makes it very easy to implement. Only the granularity has to be comparable.
You can select the frequency for your QA that can be lower than the frequency of your acquisition (if you acquire items daily, you can have only weekly benchmarks, which would still go a very long way in improving data quality tests.
Since Data Boutique’s data is Fractionable (as explained in this post), the cost of buying this data can be very low if compared to all other quality measures.
In other words, even if the data structure of Data Boutique is not a perfect match for your use case, using it for Quality Testing is a very efficient approach.
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Also published on Data Boutique