paint-brush
How To Scrape Wikipedia By Using Puppeteer and Nodejsby@tylerjoseph
1,367 reads
1,367 reads

How To Scrape Wikipedia By Using Puppeteer and Nodejs

by Tyler JosephJanuary 6th, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article, we'll go through scraping a Wikipedia table with COVID-19 data using Puppeteer and Node.io. The original article that I used for this project is located here. After scraping the data, I created an HTML file and displayed the data using Google charts. The project was created by front-end developer and owner of a nonprofit called STEM Effect. The code is very hacky, but at the end of the day, I just wanted an easier way to consume the data that I had just mined for.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How To Scrape Wikipedia By Using Puppeteer and Nodejs
Tyler Joseph HackerNoon profile picture

In this article, we'll go through scraping a Wikipedia table with COVID-19 data using Puppeteer and Node. The original article that I used for this project is located here.

I have never scraped a website before. I've always seen it as a hacky thing to do. But, after going through this little project, I can see the value of something like this. Data is hard to find and if you can scrape a website for it, in my opinion, by all means, do it.

Setup

Setting up this project was extremely easy. All you have to do is install Puppeteer with the command

npm install puppeteer
. There was one confusing issue I had during setup, however. The puppeteer package was not unzipped correctly when I initially installed it. I found this out while running the initial example in the article. If you get an error that states
Failed to launch browser process
or something similar follow these steps:

  1. Unzip
    chrome-win
    from
    node_modules/puppeteer/.local-chromium/
    
  2. Then add that folder to the
    win64
    folder in that same
    .local-chromium folder
    .
  3. Make sure the
    chrome.exe
    is in this path
    node_modules/puppeteer/.local-chromium/win64-818858/chrome-win/chrome.exe
    
  4. This is for windows specifically. Mac might be similar, but not sure.

Here is the link that lead me to the answer. It might be a good idea to do this no matter what to make sure everything is functioning properly.

The code

I had to make a couple of small changes to the existing code.

First example

The first example didn't work for me. To fix the problem I assigned the async function to a variable then invoked that variable after the function. I'm not sure this is the best way to handle the issue but hey, it works. Here is the code:

const puppeteer = require('puppeteer');

const takeScreenShot = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.stem-effect.com/');
    await page.screenshot({path: 'output.png'});

    await browser.close();
};

takeScreenShot();

Wikipedia scraper

I also had an issue with the Wikipedia scraper code. For some reason, I was getting null values for the country names. This screwed up all of my data in the JSON file I was creating.

Also, the scraper was 'scraping' every table on the Wikipedia page. I didn't want that. I only wanted the first table with the total number of cases and deaths caused by COVID-19. Here is the modified code I used:

const puppeteer = require('puppeteer');
const fs = require('fs')

const scrape = async () =>{
    const browser = await puppeteer.launch({headless : false}); //browser initiate
    const page = await browser.newPage();  // opening a new blank page
    await page.goto('https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic_by_country_and_territory', {waitUntil : 'domcontentloaded'}) // navigate to url and wait until page loads completely

    // Selected table by aria-label instead of div id
    const recordList = await page.$$eval('[aria-label="COVID-19 pandemic by country and territory table"] table#thetable tbody tr',(trows)=>{
        let rowList = []    
        trows.forEach(row => {
                let record = {'country' : '','cases' :'', 'death' : '', 'recovered':''}
                record.country = row.querySelector('a').innerText; // (tr < th < a) anchor tag text contains country name
                const tdList = Array.from(row.querySelectorAll('td'), column => column.innerText); // getting textvalue of each column of a row and adding them to a list.
                record.cases = tdList[0];        
                record.death = tdList[1];       
                record.recovered = tdList[2];   
                if(tdList.length >= 3){         
                    rowList.push(record)
                }
            });
        return rowList;
    })
    console.log(recordList)
    // Commented out screen shot here
    // await page.screenshot({ path: 'screenshots/wikipedia.png' }); //screenshot 
    browser.close();

    // Store output
    fs.writeFile('covid-19.json',JSON.stringify(recordList, null, 2),(err)=>{
        if(err){console.log(err)}
        else{console.log('Saved Successfully!')}
    })
};
scrape();

I wrote comments on the subtle changes I made, but I'll also explain them here.

First, instead of identifying the table, I wanted to use by the

div#covid19-container
, I pinpointed the table with the aria-label. This was a little more precise. Originally, the reason the code was scraping over all of the tables on the page was because the IDs were the same (I know, not a good practice. That's what classes are for, right?). Identifying the table via aria-label helped ensure that I only scraped the exact table I wanted, at least in this scenario.

Second, I commented out the screenshot command. It broke the code for some reason and I didn't see the need for it if we were just trying to create a JSON object from table data.

Lastly, after I obtained the data from the correct table I wanted to actually use it in a chart. I created an HTML file and displayed the data using Google charts. You can see the full project on my Github if you are curious. Fair warning, I got down and dirty (very hacky) putting this part together, but at the end of the day, I just wanted an easier way to consume the data that I had just mined for. There could be a whole separate article on the amount of refactoring that can be done on my HTML page.

Conclusion

This project was really fun. Thank you to the author, Mohit Maithani, for putting it together. It opened my eyes to the world of web scraping and a whole new realm of possibilities! At a high level, web scraping enables you to grab data from anywhere you want.

Like one of my favorite Youtubers, Ben Sullins likes to say, "When you free the data, your mind will follow".

Love y'all. Happy coding!

Also published at https://dev.to/tyry327/scraping-wikipedia-for-data-using-puppeteer-and-node-1f0l