Vessel is a fast, open source, high-level, web crawling and scraping framework for Ruby, built on top of — a minimal-dependency pure-Ruby driver for running headless Google Chrome instances. Ferrum Why would you need a web crawler? Perhaps you're building a search engine for an intranet or a group of public-facing websites, or just need to mirror a website with finer-grained control than tools such as wget offer. Crawl, walk, run The best way to demonstrate Vessel's capabilities is with an example. Don't worry, just because of what Vessel's capable of, it doesn't mean that it's hard to use. To get started, add Vessel to your Gemfile: gem "vessel" Next, let's build the crawler class. Create a spider.rb, in which we'll define a Spider class that derives from Vessel::Cargo, configure the crawling parameters and provide a parse callback method that will be invoked for each page that's retrieved (if you don't provide one, the Vessel::Cargo will raise a NotImplementedError when a page has been retrieved). The code for that is below: { |title| puts title } require "vessel" < :: " . . " " :// . . " (". - > > "). | | ( : . (: ), : : ) (" . - - "). | | ( : . (: ), : : ) . . class Spider Vessel Cargo domain blog scrapinghub com start_urls https blog scrapinghub com def parse css post header h2 a each do a yield request url a attribute href method parse_article end css a next posts link each do a yield request url a attribute href method parse end end def parse_article yield page title end end Spider run Most of this should be fairly self-explanatory. Behind the scenes, Vessel will employ a thread pool to perform the requests, defaulting to one thread per core (you can change this by adding threads max: n to the class definition). You can run the crawler with: bundle exec ruby spider.rb The output will be the title of each page as it's crawled and parsed by Chrome, and passed back to your Ruby class. Fast as Chrome, dead simple and yet extendable You can see from the example how easy it is to scrape — extract structured data from typically-unstructured web pages — using Ferrum's DOM methods. The example code above simply follows (via the request method) two different kinds of links (identified by their CSS-style selectors) and ignores everything else, save for the page title which is ultimately emitted as output, but you can perform any kind of information extraction of your choosing here. And whilst scraping is powerful, scraping with a crawler gives you a lot more power: rather than being confined to scraping individual pages, Vessel gives you the ability to extract data across a whole site, or set of sites, giving you complete control over exactly what links are followed and what data is returned along the way, and how what you do with it afterwards. Generate a CSV with collated tabular data? Sure, no problem. Or output JSON that you can feed into something else? That's straightforward, too. https: //github.com/rubycdp/vessel In fact, with Vessel and Ferrum, you can crawl, parse, extract, and transform web content with so little effort, you'll wonder why you ever had to do it any other way before!

Google

Effective Ways To Get More Out Of Vessel Framework

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Cory Althoff Interview: Why Our World Would End If Self-Taught Programmers Disappeared

Top 10 Development Frameworks in 2020

4 Tips for Building the Ultimate Finance App

6 Reasons for Using Bootstrap Framework

6 Reasons Why No-Code Platforms are the Future Of Software Development

65 Stories To Learn About Framework

Cory Althoff Interview: Why Our World Would End If Self-Taught Programmers Disappeared

Top 10 Development Frameworks in 2020

4 Tips for Building the Ultimate Finance App

6 Reasons for Using Bootstrap Framework

6 Reasons Why No-Code Platforms are the Future Of Software Development

65 Stories To Learn About Framework

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps