Capturing web data from e-commerce websites is very common. Although each website has its own information displayed in the different parts of the UI (product-list page, product-detail page, cart, etc), it is extremely valuable to identify a standard structure.
The advantages of adopting a standard structure are enormous when we need coordination between multiple extractions from different websites or, generally speaking, to create a resilient data extraction timeline that remains unchanged to all possible variations that even a single website can have over time.
We want to be able to decouple the extraction of data from the acquisition in a database or a data warehouse.
At databoutique.com we have gone through several iterations of the ideal data structure for capturing information from websites, and we came to the following conclusions (feel free to comment or, even better, join our conversation on Discord bout this):
Data structures need to be standardized within a given industry (let’s say fashion, or pharmaceuticals, groceries, electronics)
Data structures must be specific (or different) across the same industries.
There can be different structures depending on what level the information is captured (product-list page PLP, vs. product-detail page PDP, or cart page). This point is related to the costs of accessing the different parts of the website.
The additional advantage of organizing data structures this way is to help address the final user’s request by being transparent on where they want the information to be captured, thus aligning their request to the cost of actually acquiring it.
Accessing PDP requires first accessing the PLP. This is why having a PLP structure different than the PDP structure is important.
Let’s imagine we want to crawl daily the products of a fashion website like Zalando.
While some information might change daily (like product prices or product availability), others are static, like product description and product image (let’s leave out in this example the availability of product sizes, this will be treated separately).
We could scan daily all PLP pages and only once each product page PDP when we need to capture the details on each product. Let’s do some math and see the difference.
In the case we were to crawl daily the product detail page (PDP), we would have to access daily all pages of the product list PLP (15k pages= 1M items/80 items per page / 80% fill rate), to then pass on to each product detail page PDP (1M), for a total of 370M pages (1M X 365 + 15k > 365) scraped per year.
If we were to crawl PLP daily, and PDP only once, we would have to crawl only 6M pages a year (15k X 365 +1M).
That’s 55 times (!!) cheaper, or if you prefer if the cost per page was 0.0001 USD/page, it would cost 6.7k USD to scrape Zalando for a year through the PLP page vs. 370k USD/year with the daily PDP approach.
And this is very important to transfer it to the final user, so it is transparent that if she or he wants daily refreshes from information that is available only on the PDP page, the cost for data only would be 50 times higher.
A product list page, in the different versions we encountered, can be categorized in:
Nike.com is a good example for this: the store guides us into our shopping experience through the categories, to display products that are in our area of interest. We can off-course use the search bar, but this is secondary in the UX, as we can see from the positioning of the search bar itself.
Amazon is a good example of this. Although there is a category tree, the main UX happens via the search bar.
We will consider the Categorized lists for now and keep in mind we want a structure to crawl the PLP, not entering the Product Detail Page (PDP) for the moment.
This is particularly helpful in large websites, with near or above a million products, where the number of pages crawled varies by several orders of magnitude between PLP and PDP.
Now on the data structure. There is no golden standard, our decisions came from benchmarking many e-commerce websites and extracting the best structure in common to all of them.
we have different groups of information we need:
technical info
the field structure version
the website we are capturing
the timestamp or date the info was captured
context info
what country/ geography does the acquisition refer to
what currency are the prices expressed in
content
category level 1
category level 2
category level 3
brand
product code
product title
full price
discounted price
reference fields
imageurl
PDP URL
Note: not all websites will have three levels of category, or in some cases, not all branches of the category tree will have a complete length. In both cases, we accept “n.a.” as unavailable in the fields.
For websites with branches with more depth than three levels (but to a maximum of 4 or 5) we accept concatenation of the last levels in the category level 3.
In the Data Boutique version, we use additional fields related to our business model. Check the complete version of this field list for future updates.
Also published here.