My Journey Building a Scraper with Ruby

Author profile picture

@Salvador-ONSalvador Olvera

Last week I finished my Ruby curriculum at
Microverse
. So I was ready to build my Capstone Project. Which is a solo project at the end of each of the Microverse technical curriculum sections.
Building this project is very important because It's a real-world-like project business specifications, And we get feedback about the achievement of technical and soft skills gained during each section after we submit our project.
I received the assignment to develop a Web Scraper using Ruby to use on the website of my election.

The first question that I have was:

What is a Web Scraper?

A Web Scraper is a program that has the process of retrieving data from a website (this process is called “scraping”).
So If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, but manually.

How does a Bot Scrapper work?

The first thing that the scrapper need is a URL. Then the scraper loads the entire HTML code for the page in question some scrapers also can render the entire website, including CSS and Javascript elements.
Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run. So the user will go through the process of selecting the specific data they want from the page.
Finally, the web scraper will output all the data that has been collected into a format that is more useful to the user.

Building Time

Let's start talking about how We can build our scraper using Ruby in 3 steps.

Parsing the HTML.

The first thing that We need is to parse the HTML. some times this task can be a little difficult if we don’t have the right tools. But Ruby has this amazing gem that is Nokogiri, which helps us to parse the HTML web page.
For my project, I decided to build my bot to scrap the hackernoon coding section.
When I started targeting the different CSS selector I found a problem. I was not retrieving any information related to the CSS that I target, So I started looking for the problem. Until I found that the hackenoon website uses javascript to load some content. And the content is load after some seconds on the web site. So my bot was trying to get the information before it was loaded.
In my case, I need to use other gems like the Watir and Webdrivers to be able to get the information of my CSS selectors.
require 'nokogiri'
require 'open-uri'
require 'webdrivers'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://hackernoon.com/tagged/ruby'

Targeting and Getting the Information

After we parsed the HTML We need to get the information. And for that, We need to tell our bot what information we want. So it can search for the information. To accomplish that We can also use the Nokogiri gem. Which allows us to target any specific information, using the differents CSS selectors or xpath. That matches the information that we want to get.
browser.element(css:'div#stats').wait_until(&:present?)
parsed_page = Nokogiri::HTML(@browser.html)

browser = Watir::Browser.new
browser.goto 'https://hackernoon.com/tagged/ruby'
To be able to get the information I use the wait_until method to wait until the CSS selector appear on the webpage.
After everything was loaded on the webpage, We can start getting all the target information that we want.
titles = @parsed_page.css('div.stories-item')
art_in_page = titles.count
With this code We are getting the count of all the stories that are inside all the div elements that have the stories-item class.
parsed_page.css('div.stories-item').each do |title|
      list_title = title.css('h2 a').text
      url = 'https://hackernoon.com'
      list_ref = url + title.css('h2 a')[0].attributes['href'].text
      list = { title: list_title, ref: list_ref }
      if s_arr.all? { |i| list[:title].downcase.split.include?(i) }
        titles_arr.push(list[:title])
        ref_arr.push(list[:ref])
      end
    end
Here is another example where I target the title and the link of each story. And save that information in arrays, when the title match the key word of the search.

Storing the information

After We got all the information in our local environment, We are ready to store all the information and save it in the format file of our preferences like HTML, XML, CVS or Database. You can see how I built this part to create a very simple web page with the results in my repositorie.

Navigating to other web pages

browser.link(aria_label: 'Next').click if page < last_page
In the case of the hackernoon website was not possible to navigate to the other website changing the URL. So to make possible to change pages. I used the link and click method to indicate where the bot need to click to go to the next page, and continue scrapping the other web pages.

Fun Project

For me, this project was very fun to build. I learned a lot about how scrapers work and how useful They are in real life. I hope this article can help you to build your scraper.
Catch me on — Twitter Github LinkedIn

Tags

The Noonification banner

Subscribe to get your daily round-up of top tech stories!