Almost all web pages on the Internet contain some like navigation, sidebars, copyright information, privacy notices, and advertisements, which are not directly related to the topic of the web page. noisy blocks It is important to distinguish the main content blocks from the noisy blocks. Let’s call them . Extracting the information from the informative blocks is the most important task for Web scraper. informative blocks All child HTML elements inside each informative block ususally are grouped together into similar structures. 1. Challenge. Extraction of informative blocks from the (HN) web site. Hacker news We will use service to perform of data from HN web site. Dataflow kit scraping Open Hacker News main page Enter URL into the address bar on the top of the left panel and click the button next to it to load a web page. https://news.ycombinator.com/ The scraping process is based on data patterns you have selected. Start selecting elements on the web page clicking “ button to define patterns for data extraction. For now, it is enough to choose only two CSS Selectors from the blocks on the page. When selecting one element, all other similar to the clicked elements with the same class in sibling informative blocks are added to appropriate selector automatically. As a result 30 “Story links” with class and 29 Score Points with class are highlighted accordingly. Add Selector” .storylink .score Pressing Preview button sends request for generating sample of output returned by Dataflow kit backend. Parsed results returned by Datafalow Kit in CSV and JSON formats. Unfortunately, we’ve got two independent lists of “Scores” and “Story Links” . It seems resulted fields are not grouped as predicted! Actually we’ve expected to have “Story Links” paired together with their corresponded “Scores”. So what’s wrong with that? Let’s look at the following code describing a block containing the fields mentioned above. I’ve omitted some HTML elements from the real code for brevity. HTML <tbody><tr> ...<td><a href="http://example.com" class="storylink"> Recreating the Death Star Trench Run Scene with Lego</a></td></tr><tr>...<td><span class="score">10 points</span></td></tr> <tr> ...   
    <td>  
        <a href="http://example2.com" class="storylink"> Show HN: JournalBook – Privacy centric, offline first, personal journal app  
        </a>  
    </td>  
</tr>  
<tr>...  
    <td>  
        <span class="score">36 points</span>  
    </td>  
</tr> </tbody> In this particular case the common parent for all siblings elements is and there is no parent element joining together elements inside informative blocks. <tbody> The problem is that our scraping algorithm combines all fields together inside a block taking into account their in the DOM Tree. All of these “Story Links” and “Scores” HTML elements are actually. But although visually they seem are grouped together inside those similar blocks. common parent node siblings nodes Output from Hacker News main page provided by similar competitive scraping services looks approximately the same. Almost all of them use the same method of determining common parent for elements to group them together. The problem can be easily fixed with another approach. At Dataflow kit, there is a special option of Link extractor’s type which is intended for navigation purposes only. When option specified, no results from the current page will be returned. But instead of that, all web pages under will be visited for extracting detailed information. `Path` `Path` `Path` links In our case we can choose “Comments” field as selector as shown on pictures below. `Path` Select “Path” Selectors on the main page Add new Selector for field with corresponding CSS Selector “comments” `.subtext a+ a` Click on the right to show additional control elements. “+” Check option and click on “Path” “Details” Add selectors on detailed page 4. Detailed page is shown where you can specify all needed CSS selectors to extract data from. As you can notice, the same information as it is on the main page like , , and extra fields can be found here. “Story Link” “Score” “User” “comments” 5. Return Back to the main page by pressing and Click button. “Top-Left Arrow” “Preview” 6. You can see here some rows in the containing extracted data. If data has detailed fields, like in this case, it can be even better represented in a JSON structure. Table view Tree view So we’ve received a proper structure of tied elements inside informative blocks and no mess anymore. 2. Challenge. Extraction of informative blocks from the web site. Hacker news As it was described above we have to crawl through all 30 Links found on the main page and extract some information from linked pages. `Path` Unfortunately, we failed to get all 30 rows as expected in our first attempt, although all fetch requests are returned with 200 OK successful responses. So, What’s happened? After investigation, we’ve discovered that even something wrong happened on the way. Hacker news website always returns 200 status code, In our case some pages flew in with something like that. Message returned by HN Usually Web APIs return that means there are too many requests sent to a server. But Hacker News notifies about limit the rate of requests with Status code 200. status code 429, Experimentally we have determined that . 3 is an optimal number of concurrent requests to HN web server from One IP So after reducing the number of concurrent fetchers we have successfully crawled all 30 detailed pages and extracted all needed information from them. Choose JSON format and click to start data scraping. After finishing data extraction job press to fetch results in chosen format. `Launch` Button `Download` Launch data extraction and Download results. As you can see there were 31 requests (1 Main page + 30 Path links) to Hacker News web site were performed and it took about 33 seconds. 3. D.I.Y Challenge :) Here is the link to the final Hacker News collection profile we’ve prepared to give you a try. _Extract structured data from web sites. Web sites scraping. - slotix/dataflowkit_github.com slotix/dataflowkit Just download this collection and import it . Look at for more details about export/import feature. news.ycombinator.collection.json https://dataflowkit.com/collections You can customize all settings like pagination or add/change data field selectors to be scraped easily. Summary. Every web site has its own unique structure. Methods of scraping are identical for all of resources although some require individual approach for data extraction tasks. We appreciate your feedback and comments. Happy Scraping!

Fetch

The A-Z of Web Scraping in 2020 [A How-To Guide]

JSON Lines format: Why jsonl is better than a regular JSON for web scraping

Hacker News scraping challenge.

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

An In-depth Look Into MySQL Vs. PostgreSQL

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

Announcing Camelot, a Python Library to Extract Tabular Data from PDFs

Database APIs vs Datasets: Weighing Benefits, Drawbacks, and Transition Strategies

Effective Strategies for Efficient Data Extraction

An In-depth Look Into MySQL Vs. PostgreSQL

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

Announcing Camelot, a Python Library to Extract Tabular Data from PDFs

Database APIs vs Datasets: Weighing Benefits, Drawbacks, and Transition Strategies

Effective Strategies for Efficient Data Extraction

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps