1,350 reads

Rethinking headless browser automation with Chrome

by Joel GriffithNovember 15th, 2017

Too Long; Didn't Read

After seeing numerous Dockerfiles, <a href="https://hackernoon.com/tagged/github" target="_blank">GitHub</a> repositories, too many blog posts to count and a whole lot of medium, I decided that I wanted to have an open discussion on how folks manage their headless browser work. Especially now that there’s new libraries out there (headless Chrome, puppeteer and chromeless), I feel we’ve been given a second chance at <em>using </em>these <a href="https://hackernoon.com/tagged/tools" target="_blank">tools</a> in a better way.

Companies Mentioned

featured image - Rethinking headless browser automation with Chrome

After seeing numerous Dockerfiles, GitHub repositories, too many blog posts to count and a whole lot of medium, I decided that I wanted to have an open discussion on how folks manage their headless browser work. Especially now that there’s new libraries out there (headless Chrome, puppeteer and chromeless), I feel we’ve been given a second chance at using these tools in a better way.

This is something I’ve personally spent many days and months working on at browserless as I knew that getting it right was the best chance for success for both it and its users (especially folks just wanting a way to run headless “jobs” so to speak). First and foremost I considered what I, as a consumer would want, but then story progressed from there to “how did we even end up here?”

How we got here

Not so long ago developers were left with few choices in terms of browser-based automation. PhantomJS, nightmare, and a handful of others ruled the realm and few questioned whether or not one of the big three (Google, Mozilla or Microsoft) would release a headless variant of their browsers. As great as those browsers are, there’s one thing most of them kind of suck at: working well on linux. This achilles heel is actually what PhantomJS and a few others actually got somewhat right: installing and working right away. As a matter of fact, you could pretty reliably install phantom and deploy an app in a matter of seconds with it. The issue was, and still is, that that’s what most folks did.

It was this mentality and means of deployment that got phantom more flack than it deserved. In most application environments you wouldn’t deploy a database alongside your server’s app, nor would you interact with it directly. Instead you’d instead operate it separately (even if it is physically located on the same box) and talk with it over some socket connection. But more often than not developers didn’t do this. Instead they’d just invoke phantom, run some work, and call it a day. But what happens when your app starts getting hammered with requests? What happens when you want to massively parallelize work?

Back to the Future

I want to think back to the database as an example of how we should be doing distributed and parallelized browser work. Databases have all these cool things that come with them: connection pools, sockets, and are pretty easy to install and run. I’d say that, without much hesitation, that databases have a decent service layer around their functionality. That is: they have a strict protocol with how and when they can be interacted with.

To think about it in a different way, imagine this: writing and saving data manually on each request to a flat file. This is essentially what we’re asking headless browsers to do by the nature of how we’re interacting with them. Of course they’re going to consume a ton of resources and be flaky. Of course they are going to crash and be generally unavailable. You’re explicitly running a massive binary alongside your core application, what do you expect?

There’s a certain class problems that are going to be tough or impossible to solve any way you spin them. You can only do so much to limit the amount of resources Chrome will consume, and you can only parallelize in so much as you have resources available to do so. But you can still use things like queues and load-balancing to make things more scalable. However, it’s hard to load-balance a binary when there’s no service boundary.

Where we should be

Browserless tries to solve some of these problems mostly by treating a web browser like a database, more or less. In a database you can’t use fancy workflows like lambdas or functions-as-a-service otherwise your data would disappear after each invocation. Similarly, in a web browser, an invocation could last minutes or longer, which makes stateless lambdas tough. You also need a lot of assets for a browser to operate properly, which breaks a lot of limitations inside the serverless-type workflows. If you don’t believe me, go sort any headless library by date and witness folks fighting to get this working. I think it’s a novel way to do browser work, but it’s somewhat flawed fundamentally and fairly brittle. The fact is: if you want to remain flexible on how you interact with a browser, lambdas might not be the right choice**.**

I’d like to suggest that, instead, you treat the browser like any other service with some small differences. Instead of relying on stateless HTTP, you’ll have to let WebSockets run the show (which is how every remote Chrome library works at some level). This means that your headless service will need to proxy HTTP and WebSockets as well as babysit Chrome and concurrent connections. Sound like a pain? It kind of is. But you get pretty much everything you’d need as a consequence:

A clear path to scaling your app/tests. Easy to load balance and parallelize.
Ability to change your headless jobs without deploying a new version. Since you are driving remotely you’ll hardly ever need to deploy.
Sandboxing Chrome on its own infrastructure so it won’t bring everything down.
Easily interchange running scripts locally versus a headless farm. In browserless this is a single line of code to toggle between.

You also can build some really cool tooling alongside this. For instance, browserless comes bundled with a nifty debugger that allows you to test and see what the browser is up to during execution.

The browserless debugger in action

What are you doing with Chrome?

I’m very curious to hear how you operate and run headless work in your application. Are you still deploying it alongside your app? Running it in lambdas? Not convinced on my position? I ask that you consider giving browserless a look over as it solves a lot of problems I experienced personally. There’s also even hosted plans I’ve setup specifically so you can hit the ground running on your next project.