paint-brush
Effective Ways To Get More Out Of Vessel Frameworkby@iurii-gurzhii

Effective Ways To Get More Out Of Vessel Framework

by Iurii GurzhiiOctober 4th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Vessel is a fast, open source, high-level, web crawling and scraping framework for Ruby. Built on top of Ferrum — a minimal-dependency pure-Ruby driver for running headless Chrome instances. The output will be the title of each page as it's crawled and parsed by Chrome, and passed back to your Ruby class. With Vessel and Ferrum, you can crawl, parse, extract, and transform web content with so little effort, you'll wonder why you ever had to do it any other way before!

Company Mentioned

Mention Thumbnail
featured image - Effective Ways To Get More Out Of Vessel Framework
Iurii Gurzhii HackerNoon profile picture

Vessel is a fast, open source, high-level, web crawling and scraping framework for Ruby, built on top of Ferrum — a minimal-dependency pure-Ruby driver for running headless Google Chrome instances.

Why would you need a web crawler? Perhaps you're building a search engine for an intranet or a group of public-facing websites, or just need to mirror a website with finer-grained control than tools such as wget offer.

Crawl, walk, run

The best way to demonstrate Vessel's capabilities is with an example. Don't worry, just because of what Vessel's capable of, it doesn't mean that it's hard to use.

To get started, add Vessel to your Gemfile:

gem "vessel"

Next, let's build the crawler class. Create a spider.rb, in which we'll define a Spider class that derives from Vessel::Cargo, configure the crawling parameters and provide a parse callback method that will be invoked for each page that's retrieved (if you don't provide one, the Vessel::Cargo will raise a NotImplementedError when a page has been retrieved). The code for that is below:

require "vessel"
 
class Spider < Vessel::Cargo
  domain "blog.scrapinghub.com"
  start_urls "https://blog.scrapinghub.com"
 
  def parse
    css(".post-header>h2>a").each do |a|
      yield request(url: a.attribute(:href), method: :parse_article)
    end
 
    css("a.next-posts-link").each do |a|
      yield request(url: a.attribute(:href), method: :parse)
    end
  end
 
  def parse_article
    yield page.title
  end
end
 
Spider.run { |title| puts title }

Most of this should be fairly self-explanatory. Behind the scenes, Vessel will employ a thread pool to perform the requests, defaulting to one thread per core (you can change this by adding threads max: n to the class definition).

You can run the crawler with:

bundle exec ruby spider.rb

The output will be the title of each page as it's crawled and parsed by Chrome, and passed back to your Ruby class.

Fast as Chrome, dead simple and yet extendable

You can see from the example how easy it is to scrape — extract structured data from typically-unstructured web pages — using Ferrum's DOM methods.

The example code above simply follows (via the request method) two different kinds of links (identified by their CSS-style selectors) and ignores everything else, save for the page title which is ultimately emitted as output, but you can perform any kind of information extraction of your choosing here.

And whilst scraping is powerful, scraping with a crawler gives you a lot more power: rather than being confined to scraping individual pages, Vessel gives you the ability to extract data across a whole site, or set of sites, giving you complete control over exactly what links are followed and what data is returned along the way, and how what you do with it afterwards. Generate a CSV with collated tabular data? Sure, no problem. Or output JSON that you can feed into something else? That's straightforward, too.

https://github.com/rubycdp/vessel

In fact, with Vessel and Ferrum, you can crawl, parse, extract, and transform web content with so little effort, you'll wonder why you ever had to do it any other way before!