Almost all web pages on the Internet contain some noisy blocks like navigation, sidebars, copyright information, privacy notices, and advertisements, which are not directly related to the topic of the web page.
It is important to distinguish the main content blocks from the noisy blocks. Let’s call them informative blocks. Extracting the information from the informative blocks is the most important task for Web scraper.
All child HTML elements inside each informative block ususally are grouped together into similar structures.
1. Challenge. Extraction of informative blocks from the Hacker news (HN) web site.
- Enter https://news.ycombinator.com/ URL into the address bar on the top of the left panel and click the button next to it to load a web page.
- The scraping process is based on data patterns you have selected. Start selecting elements on the web page clicking “Add Selector” button to define patterns for data extraction. For now, it is enough to choose only two CSS Selectors from the blocks on the page. When selecting one element, all other similar to the clicked elements with the same class in sibling informative blocks are added to appropriate selector automatically. As a result 30 “Story links” with .storylink class and 29 Score Points with .score class are highlighted accordingly.
- Pressing Preview button sends request for generating sample of output returned by Dataflow kit backend.
Unfortunately, we’ve got two independent lists of “Scores” and “Story Links” . It seems resulted fields are not grouped as predicted!
Actually we’ve expected to have “Story Links” paired together with their corresponded “Scores”.
So what’s wrong with that?
Let’s look at the following HTML code describing a block containing the fields mentioned above. I’ve omitted some HTML elements from the real code for brevity.
<a href="http://example.com" class="storylink"> Recreating the Death Star Trench Run Scene with Lego
<span class="score">10 points</span>
<a href="http://example2.com" class="storylink"> Show HN: JournalBook – Privacy centric, offline first, personal journal app
<span class="score">36 points</span>
In this particular case the common parent for all siblings elements is
<tbody> and there is no parent element joining together elements inside informative blocks.
The problem is that our scraping algorithm combines all fields together inside a block taking into account their common parent node in the DOM Tree. All of these “Story Links” and “Scores” HTML elements are siblings nodes actually. But although visually they seem are grouped together inside those similar blocks.
Output from Hacker News main page provided by similar competitive scraping services looks approximately the same. Almost all of them use the same method of determining common parent for elements to group them together.
The problem can be easily fixed with another approach.
At Dataflow kit, there is a special `Path` option of Link extractor’s type which is intended for navigation purposes only. When `Path` option specified, no results from the current page will be returned. But instead of that, all web pages under `Path` links will be visited for extracting detailed information.
In our case we can choose “Comments” field as `Path` selector as shown on pictures below.
- Add new Selector for “comments” field with corresponding CSS Selector `.subtext a+ a`
- Click “+” on the right to show additional control elements.
- Check “Path” option and click on “Details”
4. Detailed page is shown where you can specify all needed CSS selectors to extract data from. As you can notice, the same information as it is on the main page like “Story Link”, “Score”, “User” and extra “comments” fields can be found here.
5. Return Back to the main page by pressing “Top-Left Arrow” and Click “Preview” button.
6. You can see here some rows in the Table view containing extracted data. If data has detailed fields, like in this case, it can be even better represented in a Tree view JSON structure.
So we’ve received a proper structure of tied elements inside informative blocks and no mess anymore.
2. Challenge. Extraction of informative blocks from the Hacker news web site.
As it was described above we have to crawl through all 30 `Path` Links found on the main page and extract some information from linked pages.
Unfortunately, we failed to get all 30 rows as expected in our first attempt, although all fetch requests are returned with 200 OK successful responses.
So, What’s happened?
After investigation, we’ve discovered that Hacker news website always returns 200 status code, even something wrong happened on the way.
In our case some pages flew in with something like that.
Usually Web APIs return status code 429, that means there are too many requests sent to a server. But Hacker News notifies about limit the rate of requests with Status code 200.
Experimentally we have determined that 3 is an optimal number of concurrent requests to HN web server from One IP.
So after reducing the number of concurrent fetchers we have successfully crawled all 30 detailed pages and extracted all needed information from them.
Choose JSON format and click `Launch` Button to start data scraping. After finishing data extraction job press `Download` to fetch results in chosen format.
As you can see there were 31 requests (1 Main page + 30 Path links) to Hacker News web site were performed and it took about 33 seconds.
3. D.I.Y Challenge :)
Here is the link to the final Hacker News collection profile we’ve prepared to give you a try.
Extract structured data from web sites. Web sites scraping. - slotix/dataflowkitgithub.com
You can customize all settings like pagination or add/change data field selectors to be scraped easily.
Every web site has its own unique structure. Methods of scraping are identical for all of resources although some require individual approach for data extraction tasks.
We appreciate your feedback and comments.