Vessel is a fast, open source, high-level, web crawling and scraping framework for Ruby, built on top of Ferrum — a minimal-dependency pure-Ruby driver for running headless Google Chrome instances.
Why would you need a web crawler? Perhaps you're building a search engine for an intranet or a group of public-facing websites, or just need to mirror a website with finer-grained control than tools such as wget offer.
The best way to demonstrate Vessel's capabilities is with an example. Don't worry, just because of what Vessel's capable of, it doesn't mean that it's hard to use.
To get started, add Vessel to your Gemfile:
gem "vessel"
Next, let's build the crawler class. Create a spider.rb, in which we'll define a Spider class that derives from Vessel::Cargo, configure the crawling parameters and provide a parse callback method that will be invoked for each page that's retrieved (if you don't provide one, the Vessel::Cargo will raise a NotImplementedError when a page has been retrieved). The code for that is below:
require "vessel"
class Spider < Vessel::Cargo
domain "blog.scrapinghub.com"
start_urls "https://blog.scrapinghub.com"
def parse
css(".post-header>h2>a").each do |a|
yield request(url: a.attribute(:href), method: :parse_article)
end
css("a.next-posts-link").each do |a|
yield request(url: a.attribute(:href), method: :parse)
end
end
def parse_article
yield page.title
end
end
Spider.run { |title| puts title }
Most of this should be fairly self-explanatory. Behind the scenes, Vessel will employ a thread pool to perform the requests, defaulting to one thread per core (you can change this by adding threads max: n to the class definition).
You can run the crawler with:
bundle exec ruby spider.rb
The output will be the title of each page as it's crawled and parsed by Chrome, and passed back to your Ruby class.
You can see from the example how easy it is to scrape — extract structured data from typically-unstructured web pages — using Ferrum's DOM methods.
The example code above simply follows (via the request method) two different kinds of links (identified by their CSS-style selectors) and ignores everything else, save for the page title which is ultimately emitted as output, but you can perform any kind of information extraction of your choosing here.
And whilst scraping is powerful, scraping with a crawler gives you a lot more power: rather than being confined to scraping individual pages, Vessel gives you the ability to extract data across a whole site, or set of sites, giving you complete control over exactly what links are followed and what data is returned along the way, and how what you do with it afterwards. Generate a CSV with collated tabular data? Sure, no problem. Or output JSON that you can feed into something else? That's straightforward, too.
https://github.com/rubycdp/vessel
In fact, with Vessel and Ferrum, you can crawl, parse, extract, and transform web content with so little effort, you'll wonder why you ever had to do it any other way before!