Last week I finished my Ruby curriculum at
. So I was ready to build my Capstone Project. Which is a solo project at the end of each of the Microverse technical curriculum sections. Microverse
Building this project is very important because It's a real-world-like project business specifications, And we get feedback about the achievement of technical and soft skills gained during each section after we submit our project.
I received the assignment to develop a Web Scraper using Ruby to use on the website of my election.
A Web Scraper is a program that has the process of retrieving data from a website (this process is called “scraping”).
So If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, but manually.
The first thing that the scrapper need is a URL. Then the scraper loads the entire HTML code for the page in question some scrapers also can render the entire website, including CSS and Javascript elements.
Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run. So the user will go through the process of selecting the specific data they want from the page.
Finally, the web scraper will output all the data that has been collected into a format that is more useful to the user.
Let's start talking about how We can build our scraper using Ruby in 3 steps.
The first thing that We need is to parse the HTML. some times this task can be a little difficult if we don’t have the right tools. But Ruby has this amazing gem that is Nokogiri, which helps us to parse the HTML web page.
Here you can find more specific information
For my project, I decided to build my bot to scrap the hackernoon coding section.
When I started targeting the different CSS selector I found a problem. I was not retrieving any information related to the CSS that I target, So I started looking for the problem. Until I found that the hackenoon website uses javascript to load some content. And the content is load after some seconds on the web site. So my bot was trying to get the information before it was loaded.
In my case, I need to use other gems like the Watir and Webdrivers to be able to get the information of my CSS selectors.
require 'nokogiri'
require 'open-uri'
require 'webdrivers'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://hackernoon.com/tagged/ruby'
Here you can find more information related to Watir gem
After we parsed the HTML We need to get the information. And for that, We need to tell our bot what information we want. So it can search for the information. To accomplish that We can also use the Nokogiri gem. Which allows us to target any specific information, using the differents CSS selectors or xpath. That matches the information that we want to get.
Here you can find more specific information
browser.element(css:'div#stats').wait_until(&:present?)
parsed_page = Nokogiri::HTML(@browser.html)
browser = Watir::Browser.new
browser.goto 'https://hackernoon.com/tagged/ruby'
To be able to get the information I use the wait_until method to wait until the CSS selector appear on the webpage.
After everything was loaded on the webpage, We can start getting all the target information that we want.
titles = @parsed_page.css('div.stories-item')
art_in_page = titles.count
With this code We are getting the count of all the stories that are inside all the div elements that have the stories-item class.
parsed_page.css('div.stories-item').each do |title|
list_title = title.css('h2 a').text
url = 'https://hackernoon.com'
list_ref = url + title.css('h2 a')[0].attributes['href'].text
list = { title: list_title, ref: list_ref }
if s_arr.all? { |i| list[:title].downcase.split.include?(i) }
titles_arr.push(list[:title])
ref_arr.push(list[:ref])
end
end
Here is another example where I target the title and the link of each story. And save that information in arrays, when the title match the key word of the search.
After We got all the information in our local environment, We are ready to store all the information and save it in the format file of our preferences like HTML, XML, CVS or Database. You can see how I built this part to create a very simple web page with the results in my repositorie.
browser.link(aria_label: 'Next').click if page < last_page
In the case of the hackernoon website was not possible to navigate to the other website changing the URL. So to make possible to change pages. I used the link and click method to indicate where the bot need to click to go to the next page, and continue scrapping the other web pages.
For me, this project was very fun to build. I learned a lot about how scrapers work and how useful They are in real life. I hope this article can help you to build your scraper.