How to gather data without those pesky databases. Web scraping is a great way to create dynamic websites without having to contact a database for information. To get started with web scraping, you should know how a website is structured. If you right-click on a page and click inspect (on Chrome), you can see the developer tools. This shows you the structure of the website’s HTML/CSS/JavaScript code, as well as network performance, errors, security, and much more. Now, let’s say I want to grab the first image that you see on Twitter programmatically in the JavaScript console. Well, I could right-click on the image, click inspect, right-click on the element in the dev tools, and copy the CSS selector. Then, I could do a and that would give me the URL of the image I want, and I could use that on a web page, for example: document.querySelector(<<SELECTOR>>).src This web scraping! I was able to gather data (an image) from a website without having access to the database. But this is super tedious and long, so to actually web scrape more efficiently, I use Node.js and Puppeteer. is If you don’t already know, is a runtime environment that allows JavaScript to be run on the server-side. And is a ‘headless Chrome node API’ written by Google (basically, it allows you to write DOM JavaScript code on a server). Node.js Puppeteer Just an FYI, because I love TypeScript, I will be using that language for this project. If you want to use TypeScript, please install it on your system. If running yields the typescript version in the terminal, you're good to go! tsc -v Okay, to start off, make sure you have Node.js and npm (Node Package Manager) installed on your system. If you get a or something related by running one of the following, I suggest that you look at this article on how to install Node. command not found $ npm -v # should be or higher $ node -v # should be or higher 6.0 .0 9.0 .0 Great! Let’s start a new project and install the dependencies: $ mkdir Web-Scraping && cd Web-Scraping $ npm init # go through all defaults 
$ npm i puppeteer # the google npm scraping package 
$ tsc --init # initialize typescript 
$ npm i @types/puppeteer # type declarations -101 -101 Now, open the folder in the text editor of your choice. Edit the option in the file to be and uncomment the line, so it looks like this: outDir tsconfig.json ./build Create a new file in the root of the folder: touch app.ts In app.ts add: .log( ); console "Twitter, here we come" To run this, in the terminal, write: tsc && node build/app.js : builds all TypeScript files into the directory defined in the config file and runs a single JavaScript file. Note tsc outDir node If you see Twitter, here we come appear in the terminal, you’ve got it working! “ ” Now, we will start to actually scrape using Puppeteer. Add this boilerplate Puppeteer code to the file: app.ts puppeteer ; ( () => { browser = puppeteer.launch(); page = browser.newPage(); page.goto( ); dimensions = page.evaluate( { { width: .documentElement.clientWidth, : .documentElement.clientHeight, : .devicePixelRatio
    };
  }); .log( , dimensions); browser.close();
})(); import from "puppeteer" // import the npm package that we installed async // the rest of the code must be enclosed in an `async` function to be able to `await` for results const await // launches an "invisible" chromium browser const await // takes the browser to a new tab (page) await "https://example.com" // takes the page to a specific url // Get the "viewport" of the page, // as reported by the page. // Anything inside of the `evaluate` function is DOM manipulation. NOTE: // No variables outside of the evaluate function can go in, and none can come out without being returned inside of the return object. const await => () return // use DOM manipulation to access the width and height of the page // if you want to get elements out of the DOM and into the node js code, return theme here document height document deviceScaleFactor window // print out the DOM data console "Dimensions:" // remember to close the broser (invisible chromium) await Please read through the commented code above to get a feel for what is going on. Now that you can see how we can travel to a web page, gather info using DOM manipulation, and bring that info back to the Node js program, we are ready to scrape Twitter. First, edit the to be . await page.goto("https://example.com") await page.goto("https://twitter.com") Next, we need to be able to get the posts from the middle column (the actual Twitter feed). After some investigating, I found this selector is the one that actually selects the for the middle column feed: div .querySelector( ); document "#react-root > div > div > div > main > div > div.css-1dbjc4n.r-aqfbo4.r-1niwhzg.r-16y2uox > div > div.css-1dbjc4n.r-14lw9ot.r-1tlfku8.r-1ljd8xs.r-13l2t4g.r-1phboty.r-1jgb5lz.r-1ye8kvj.r-13qz1uu.r-184en5c > div > div > div.css-1dbjc4n.r-1jgb5lz.r-1ye8kvj.r-6337vo.r-13qz1uu > div > section > div > div > div" // the above returns the div for the middle column twitter feed Here is an image of what that represents: To get all of the images from the middle column, I ended up doing this for the function: page.evaluate() dimensions = page.evaluate( { sources = []; .querySelectorAll( ).forEach( { (img.src) {
          sources.push(img)
      }
    }); {
    sources
  }
} const await => () let // an array of the links to each image document "#react-root > div > div > div > main > div > div.css-1dbjc4n.r-aqfbo4.r-1niwhzg.r-16y2uox > div > div.css-1dbjc4n.r-14lw9ot.r-1tlfku8.r-1ljd8xs.r-13l2t4g.r-1phboty.r-1jgb5lz.r-1ye8kvj.r-13qz1uu.r-184en5c > div > div > div.css-1dbjc4n.r-1jgb5lz.r-1ye8kvj.r-6337vo.r-13qz1uu > div > section > div > div > div img" => img if return Now, if I want to compile a list of all of the image sources and print them out to the console, all I have to do is write this outside of the function: page.evaluate() .log(dimensions.sources); console There you go! You’ve just scraped image data from a Twitter feed. A final challenge would be to take this data and integrate it into an Express.js server so that, when a user goes to the root site, they are presented with all of these scraped images. Resources Node.js TypeScript Express.js Thanks for reading!

Google

Super

Twitter

Data Scraping in Node.js 101

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Guide to Password Hashing: How to Keep your Database Safe

The Noonification: Top 10 CSS Performance Tips (11/14/2022)

100 Stories To Learn About Html5

10 Things You Probably Didn’t Know about Sass

10 Reasons To Use TailwindCSS In Your Next Project

10 Frontend Projects to Take Your Coding Skills to the Next Level

A Guide to Password Hashing: How to Keep your Database Safe

The Noonification: Top 10 CSS Performance Tips (11/14/2022)

100 Stories To Learn About Html5

10 Things You Probably Didn’t Know about Sass

10 Reasons To Use TailwindCSS In Your Next Project

10 Frontend Projects to Take Your Coding Skills to the Next Level

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps