Scrape any Website/Service/API with a single SQL Select Statement

Written by mikeptweet | Published 2017/01/18
Tech Story Tags: javascript | phantomjs | sql | web-scraping

TLDRvia the TL;DR App

I love SQL and it never ceases to amaze me what can be accomplished via the power of SQL syntax. In this example, we are going to create a simple wrapper that treats any Web page or HTTP endpoint as a SQL table that can be queried via a single SQL select statement. Furthermore we will demonstrate how we can takes the results of our query and export it into a variety of useful formats : (JSON,TEXT, TAB/CSV Delimited, EXCEL, etc).

This example leverages one of my favourite applications, PhantomJS.

PhantomJS | PhantomJS_Edit description_phantomjs.org

PhantomJS is a headless WebKit scriptable with a JavaScript API that enables such functionality as:

HEADLESS WEBSITE TESTING

Run functional tests with frameworks such as Jasmine, QUnit, Mocha, Capybara, WebDriver, and many others.Learn more

SCREEN CAPTURE

Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview. Learn more

PAGE AUTOMATION

Access and manipulate webpages with the standard DOM API, or with usual libraries like jQuery. Learn more

NETWORK MONITORING

Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins. Learn more

In conjunction with PhantomJS, we are also going to leverage the awesome JavaScript SQL library- ALASQL from:

AlaSQL - javascript SQL database library_AlaSQL was written with pure JavaScript and does not use browser WebSQL database._alasql.org

AlaSQL — ‘à la SQL’ — is a lightweight client-side in-memory SQL database designed to work in browser and Node.js.

AlaSQL was written with pure JavaScript and does not use browser WebSQL database.

AlaSQL is fully functional compact sql server with JOINs, GROUPs, UNIONs, ANY, ALL, IN, subqueries and very limited transactions support.

AlaSQL supports ROLLUP(), CUBE() and GROUPING SETS() functions

AlaSQL works with all modern versions of browsers (Chrome, Firefox, IE, Safari), Node.js, and mobile iOS and Android.

AlaSQL is fast, because it uses some optimization methods.

The code to perform our task is as follows:

To make this work, you first need to install PhantomJS. The instructions are here:

Download | PhantomJS_Note: For this static build, the binary is self-contained. There is no requirement to install Qt, WebKit, or any other…_phantomjs.org

Once you’ve successfully downloaded PhantomJS, create a new text file called scrapepage.js using the above code. From your command prompt, you can then start using the application by simply typing in a command like the following:

phantomjs scrapepage.js "https://news.ycombinator.com/news" "select nodeName,href,innerText from document where className='storylink' limit 5"

this should produce a list something like the following:

or you could try something like the following:

phantomjs scrapePage.js "https://news.ycombinator.com/news" "select nodeName,count(nodeName) as Nodes INTO CSV() from document group by nodeName order by Nodes desc"

Notice that you can take advantage of most of the options of ALASQL, described here:

agershun/alasql_alasql - AlaSQL.js - JavaScript SQL database for browser and Node.js. Handles both traditional relational tables and…_github.com

You can obviously pipe the results of your query into a file as follows:

phantomjs scrapepage.js "url" "sql" > somefile.json

For example, let’s create an Excel spreadsheet by typing:

phantomjs scrapepage.js "https://news.ycombinator.com/news" "select nodeName,href,innerText INTO XLSXML({headers:true}) from document where className='storylink'" > mydata.xls

As you can see, you can pretty much takes the contents of any webpage and convert it into a useful format by executing a single command!

Each row in the documents table represents a single node in the HTML Dom (Document Object Model) Tree of the web page. To see all the fields on a particular node, open up the development tools in the browser and take a peek at the nodes in the console :

To access child properties on the nodes, use the syntax described here:

agershun/alasql_alasql - AlaSQL.js - JavaScript SQL database for browser and Node.js. Handles both traditional relational tables and…_github.com

For example, to access properties on the style object, you could specify:

select nodeName,style->border as border from document

To call methods,

select nodeName,attributes->getNamedItem("href") as href from document

This article just scratches the surface of what is possible using PhantomJS in conjunction with ALASQL. I highly recommend reading the documentation on both products to master the art of web page scrapping.

Please recommend this article if you liked it!


Published by HackerNoon on 2017/01/18