From a technical marketer perspective, scraping and automation libraries are extremely important to learn. Here’s an introduction to two of the most widely used web scraping libraries in Node JS. When I talk to developers, they always find it weird that I love web scraping. This is mainly because of two reasons: * Scraping is an unstable and unreliable solution for pulling data from data sources, compared to APIs. * In terms of code, building scrapers means writing code that doesn’t necessarily comply with some best practices, such as reusability. The code is usually very much tied up with a specific use case. But the thing is, when a marketer starts learning code, a ton of scraping use cases immediately come to mind. Much of the work that marketers dream of being able to automate can’t be achieved with official APIs. Getting rid of some of the manual work of extracting information from the web is very tempting. I would argue that for a marketer, scraping and automation are among the most common use-cases for coding skills. Recently I had the chance to work with Puppeteer and Cheerio, and switch between the two, so here’s a marketer’s perspective on when to use each of them. Puppeteer Puppeteer is a Node open source library developed by Google. It is basically a way to launch a browser via Node and automate actions on Chrome. The main use case of Puppeteer is Automation. It’s not always simple to scrape data. Take, for example, my Product Hunt scraper, Hunt. In Product Hunt, upvoter information is not readily available in the page’s HTML when you first load it. Before you can access the full upvoter list, you have to: Click the upvoters panel.Scroll all the way to the end of the list To do so, you need a tool that can automate actions in the browser – That’s what puppeteer is for. Use Puppeteer when you need to log in to get data, or when you need to perform automated actions in the browser. Cheerio is another NPM library, also called “JQuery for Node”. It allows you to scrape data with a lightweight, simple and quick framework. Cheerio works with raw HTML data that input to it, similar to Python’s Beautiful Soup, if you’re familiar. That means that if the data you need to parse can be extracted from a url, it is very simple to work with in Cheerio. Cheerio Below is code that can be used to extract information from twitter about a list of users (by twitter tag). axios = ( ); cheerio = ( ); getHtml = url => {
  link = axios.get(url); link.data;
}; .exports = { enrichedUsers = []; (user productHuntUserList) { { $ = cheerio.load( getHtml( )); followers = $( ).text(); description = $( ).text(); url = ; enrichedUser = { : user.tag, : user.name, : user.profile, : description, : followers, : url, : };
      enrichedUsers.push(enrichedUser); } (e) { ; }
  } enrichedUsers;
}; const require "axios" const require "cheerio" //This function uses axios to get the html data on a given url it's also possible to do the same using the fetch method const async await return //this is a node module that uses the above function and Cheerio to extract twitter data from a list of user tag (used in the backend of Hunt) module async ( ) function run userList const for of try const await `https://twitter.com/ ` ${user.tag} // here we extract the relevant information from each twitter page - followers number, description, and Twitter URL let ".ProfileNav-item.ProfileNav-item--followers > a > span.ProfileNav-value" const ".ProfileHeaderCard-bio" const `https://twitter.com/ ` ${user.tag} // create a user object with existing info and the new info we've extracted from twitter const tag name profile twitterDescription twitterFollowers pageUrl messagedAndFollowed false // push the new user object into the enrichedUsers array catch continue // this is not a good way to handle errors, but I didn't want to get into error handling here and it works for the sake of this tutorial. return Cheerio VS Puppeteer The two libraries have different use cases but are often seem as the two main options for JS scraping. If I had to choose, I could argue that if there’s no need for Puppeteer’s automation capabilities, it would be more efficient and better practice to use Cheerio. While working on Hunt, I’ve built 2 Scrapers – one for Product Hunt and one for Twitter. I initially built both with Puppeteer, and I noticed a lot of performance issues when trying to scrape a large list of users from Twitter (including memory errors on the Heroku server) – it took Puppeteer about 10 minutes to finish scraping a 1000 upvoters. I then rewrote the Twitter bot in Cheerio (as described above) and saw a performance boost of around 5X+ : The new code took about 2 minutes (or less) to finish scraping. Summary Both tools allow you to use node for automation and scraping in ways that marketers usually attribute to Python. These tools are another example of how learning Javascript might be a pain in the ass, but can eventually give you more profound and holistic knowledge of web development. As a marketer, you can probably think of many ways to use both, and I recommend that you do so and go for it. If you’re learning something new, you might as well create something useful!

Fetch

Google

Heroku

Twitter

4 Notion + Zapier Integrations You Can Implement Today

Visit my site

Read My Stories

Too Long; Didn't Read

Download free Tech Leaders Productivity Report!

Web Scraping Use Cases for Technical Marketers

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

4 Notion + Zapier Integrations You Can Implement Today

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

4 Notion + Zapier Integrations You Can Implement Today

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps