2,968 reads

7 Tips to Making Your Puppeteer Scripts More Bulletproof

by Joel GriffithAugust 18th, 2022

Too Long; Didn't Read

Puppeteer comes out of the box with a lot of great features and makes it dead simple to get started automating your web workflows. But, like any other piece of software, the devil is in the details. Since puppeteer (and playwright) are both heavily `async-heavy, there's plenty of opportunity for things to go wrong. With that said, today I'd like to go over a list of things every developer should do to make their Puppeteer scripts more ironclad and graceful.

featured image - 7 Tips to Making Your Puppeteer Scripts More Bulletproof

Puppeteer comes out of the box with a lot of great features and makes it dead simple to get started automating your web workflows. In fact, you can likely get everything up and running locally within minutes. Because of its fairly simple API you can likely automate whatever it is you're doing with relatively few lines of code.

But, like any other piece of software, the devil is in the details. Since puppeteer (and playwright) are both heavily async-heavy, there's a lot of opportunity for things to go wrong. If you're aggregating data or automating sites without APIs then the ever-changing nature of the web can easily show up to ruin the fun. To be clear: it's just as easy for things to go wrong with puppeteer as it is for them to go right. A lot of times it's even easier 😢

I'm no stranger to puppeteer, or even web automation, and have even written my own headless web drivers in the past. It's been the longest standing passion of mine since I started programming over 10 years ago. With that said, today I'd like to go over a list of things every developer should do to make their puppeteer scripts more ironclad and graceful. It won't solve all your problems, but it'll give you enough situational awareness to make the issue(s) a lot easier to diagnose and fix. With that, let's jump in!

Change your page.goto calls

Puppeteer will default timeout after 30 seconds in a page.goto call when the specified "load" event doesn't happen. Even though there's some great loading events you can key off of (like networkidle2 events) we generally recommend using the domcontentloaded option when you know what you're doing. Especially if you're after a particular page element or network call to wait for:

// DON'T do this:
// This will wait for network requests to be idle for 2 seconds before proceeding.
await page.goto('https://www.example.com', { waitUntil: 'networkidle2' });
await page.waitForSelector('h1');

// Do this instead:
// This will navigate to the page and *immediately* begin waiting for the h1 selector
// once the page's initial HTML is returned. Often much quicker.
await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
await page.waitForSelector('h1');

This essentially is just optimizing your puppeteer code to run as fast as possible and not let slow network requests get in the way. Often times we (browserless.io) will see a network request NOT respond, which causes your whole script to break as goto never completes successfully within networkidle2.

When shouldn't you use this? Well, the only time you really shouldn't is when you're dealing with the unknown. If you're not keenly aware of the pages layout, network calls, or DOM selectors, then doing a networkidle0 or networkidle2 is about as good as you can get to when the page is "interaction ready."

Log your failed network calls

Lots of times we see that it's a simple network call that fails an entire workflow. This can be differences between running puppeteer locally versus on a cloud server, a site being down, or even networking issues elsewhere. Though the internet generally works, it still relies on physical devices being online and running, which isn't always the case.

The best thing you can do is to simply log failed requests and attempt to proceed on:

// DON'T do this: there's nothing here catching _any_ network errors!
await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
await page.waitForSelector('h1');

// Do this instead:
page.on('response', (res) => {
  if (!res.ok()) {
    console.error(`Non-200 response from this request: [${req.status()}] "${req.url()}"`);
  }
});

await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
await page.waitForSelector('h1');

While we're at it, we might as well wire up some kind of page error handler as well and get a utility in place for these kinds of logs/warnings.

// Drop in your favorite logging technology here. We're using console.error to illustrate.
const logWarning = (message) => console.error(message);

// Page errors here might be deal-breakers
page.on('pageerror', (err) => {
  logWarning(`Page error emitted: "${err.message}"`);
});

page.on('response', (res) => {
  if (!res.ok()) {
    logWarning(`Non-200 response from this request: [${req.status()}] "${req.url()}"`);
  }
});

await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
await page.waitForSelector('h1');

Monitor the browser process

Depending on how you're connecting to the browser, there are a few things here that can help us make sure our scripts are resilient. The first thing is doing a listener on the disconnected event as it indicates that either the browser itself has crashed or we've been disconnected. Doing this is pretty straightforward:

browser.once('disconnected', () => logWarning(`Browser has closed or crashed and we've been disconnected!`));

Now combine that with our prior code snippets for the following:

const logWarning = (message) => console.error(message);

browser.once('disconnected', () => logWarning(`Browser has closed or crashed and we've been disconnected!`));

const page = await browser.newPage();

page.on('pageerror', (err) => {
  logWarning(`Page error emitted: "${err.message}"`);
});

page.on('response', (res) => {
  if (!res.ok()) {
    logWarning(`Non-200 response from this request: [${req.status()}] "${req.url()}"`);
  }
});

await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
await page.waitForSelector('h1');

Intermission

As you might have noticed, we've started to add a lot of listeners and other handlers! It's starting to become a bit un-wieldy from a best-practices standpoint, and for the most part, all of this code is essentially just boilerplate. To say this another way: all of our code here has almost nothing to do with the automation side of puppeteer. Because of this, we should start thinking about making a higher-order class that abstracts all of this logic away and lets our business-oriented code (the code that's doing the actual automation) do its thing.

Let's define two modules: one to handle the events and logging/monitoring, and the second that just holds our business logic and consumes the first module. This also makes it easier for us later if we need to add more scripts as we've abstracted away all the management aspects of puppeteer into a reusable module. I'll illustrate this in TypeScript since it better aligns with our goal of having a more deterministic puppeteer experience.

// puppeteer-helper.ts
import { Browser, Page } from 'puppeteer';

export class PuppeteerHelper {
  private page?: Page;

  static log(...messages: string[]) {
    // Use your logging library here
    console.warn(...messages);
  }

  private disconnectListener = () => {
    PuppeteerHelper.log(
      `Browser has closed or crashed and we've been disconnected!`
    );
  };

  constructor(private browser: Browser) {
    browser.once('disconnected', this.disconnectListener);
  }

  public close = async () => {
    this.browser.off('disconnect', this.disconnectListener);
    this.browser.close();
  }

  public newPage = async () => {
    const page = await this.browser.newPage();

    page.on('pageerror', (err) => {
      PuppeteerHelper.log(`Page error emitted: "${err.message}"`);
    });

    page.on('response', (res) => {
      if (!res.ok()) {
        PuppeteerHelper.log(`[${req.status()}]: "${req.url()}"`);
      }
    });

    // Capture the page so we can track stuff later
    this.page = page;

    return page;
  }
}

Now, with that out of the way we can simply incorporate this easily into our simple script:

import puppeteer from 'puppeteer';
import { PuppeteerHelper } from './puppeteer-helper.ts';

(async() => {
  const browser = new PuppeteerHelper(await puppeteer.launch());
  const page = await browser.newPage();

  await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
  await page.waitForSelector('h1');
  await browser.close();
})();

Perfect! Now that we've got a bit of a management layer in place, let's move on.

Avoiding 'file://' requests

More of a security precaution here, but we really (ideally) want to avoid chrome making calls to the local file system if we can help it. That, plus there's a slew of IP addresses that most cloud providers use to read hosting metadata that we also want to firewall away from Chrome.

I want to reiterate how important it is that we NEVER allow file-system access as it's crucial that things like /etc/passwd remain hidden. Unless you want someone taking over your cloud machines, it's best to add this right away.

static: string[] urlDeny = [
  'file://',
  'http://169.254',
  'https://169.254',
  '169.254'
];

static: string[] ipDeny = [
  '169.254',
  '192.168',
  '0.0.0.0',
];

// Later, let's use these and make sure they get reject right away:
public newPage = async () => {
  const page = await this.browser.newPage();

  page.on('request', (req) => {
    if (PuppeteerHelper.urlDeny.some((url) => req.url().startsWith(url))) {
      PuppeteerHelper.log(`Blocking request and exiting: "${req.url()}"`);
      this.close();
    }
  });

  page.on('response', (res) => {
    const responseUrl = res.url();
    const remoteAddressIP = res.remoteAddress().ip;

    if (!res.ok()) {
      PuppeteerHelper.log(`[${res.status()}]: "${res.url()}"`);
    }

    if (responseUrl && PuppeteerHelper.urlDeny.some((url) => responseUrl.startsWith(url))) {
      PuppeteerHelper.log(`Blocking request URL and exiting: "${responseUrl}"`);
      this.close();
    }

    if (remoteAddressIP && PuppeteerHelper.ipDeny.some((ip) => remoteAddressIP.startsWith(ip))) {
      PuppeteerHelper.log(`Blocking request IP and exiting: "${responseUrl}"`);
      this.close();
    }
  })

  page.on('pageerror', (err) => {
    PuppeteerHelper.log(`Page error emitted: "${err.message}"`);
  });

  this.page = page;

  return page;
};

With any project there can be security things we'll need to come up along the way, so it's best to always treat any technology's security as a permanent work-in-progress. Here we can have some assurances that things will get killed before network responses back if the resulting IP address or URL is sensitive and we're dealing with a bad actor.

While we're on the subject of security, let's do one more thing...

Separate Chrome from your application

It's a common best-practice to generally separate concerns in software. A good example is ensuring your application code runs on separate hardware than, say, your database. This gives the programmer a lot of flexibility on how to scale but also allows tighter controls on things like firewalls and connection pools for your database. Since we're dealing with a blackbox (Chrome) we should also exercise caution in working with it.

Instead of launching Chrome locally with puppeteer let's start Chrome elsewhere and connect to it. browserless makes this trivial as it's a sandboxing layer of sorts for Chrome, meaning you can run it in an ephemeral fashion totally segregated from your app code and data. This, combined with the network helper up above, makes it easy to have more confidence in your deployment.

I'll leave the docker aspects of running browserless as an exercise to you (we've got a lot of resources on that), but once you have it running (we assume localhost and port 3000 below), simply connect to it:

// DON'T launch locally!
const browser = new PuppeteerHelper(await puppeteer.launch());

// DO replace launch with connect:
const browser = new PuppeteerHelper(await puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' }));

Now that we've got Chrome all situated and away from our codebase, let's do one last thing in the case of failures.

Take a picture, it lasts longer!

One of the MOST helpful things you can do is screenshot the webpage in the case of a failure. This, combined with network logging, can more easily give you clues as to what's going on. Scraping and automating the web isn't as deterministic as we'd like, however giving yourself as much context and evidence can make experience a whole lot better.

In order to do that and save on same network bandwidth, let's just do a quick JPEG image at 50% quality. Setting dimensions at 1024x768 should also keep sizing matters down as well, but you can do whatever you'd like. We'll adjust our 'close' method to capture a last screenshot so we can also keep track of page changes over time as well. This will help you immensely if you get notified at 2AM that something's wrong since you'll have a photo in your mind of the page as it should be

public close() {
  this.browser.off('disconnect', this.disconnectListener);
  if (this.page) {
    // screenshot here is a base64-encoded string of the jpeg file.
    // We don't do anything here with it since this greatly depends on your tools and tooling.
    // We'll catch issues here and just return `null` in cases of total failure
    const screenshot = await this.page.screenshot({ type: 'jpeg', quality: 50, fullPage: true }).catch(() => null);
  }
  this.browser.close();
}

In order for this to take effect, we'll need to update our business-level code to catch errors and attempt this screenshot. This, too, is pretty straightforward:

import puppeteer from 'puppeteer';
import { PuppeteerHelper } from './puppeteer-helper.ts';

(async() => {
  const browser = new PuppeteerHelper(await puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' }));

  try {
    const page = await browser.newPage();
    await page.goto('https://www.example.com', { waitUntil: 'domcontentloaded' });
    await page.waitForSelector('h1');
    await browser.close();
  } catch (err) {
    PuppeteerHelper.log(`Error running script: ${err.message}`);
  } finally {
    browser.close();
  }
})();

Now that there's a visual history of what the page looks like over time, we can begin to paint a better understanding of how things will change over time and feel more confident in our automation. Putting it all together, here's the module we've written:

// puppeteer-helper.ts
import { Browser, Page } from 'puppeteer-core';

export class PuppeteerHelper {
  private page?: Page;

  constructor(private browser: Browser) {
    browser.once('disconnected', this.disconnectListener);
  }

  // Add some static url's and IPs to our class:
  static urlDeny: string[] = [
    'file://',
    'http://169.254',
    'https://169.254',
    '169.254'
  ];

  static ipDeny: string[] = [
    '169.254',
    '192.168',
    '0.0.0.0',
  ];

  static log(...messages: string[]) {
    // Use your logging library here
    console.warn(...messages);
  }

  private disconnectListener = () => {
    PuppeteerHelper.log(
      `Browser has closed or crashed and we've been disconnected!`
    );
  };

  public close = async () => {
    this.browser.off('disconnect', this.disconnectListener);
    if (this.page) {
      // screenshot here is a base64-encoded string of the jpeg file.
      // We don't do anything here with it since this greatly depends on your tools and tooling.
      // We also catch and return null here in case something fails
      const screenshot = await this.page.screenshot({ type: 'jpeg', quality: 50, fullPage: true }).catch(() => null);
    }
    this.browser.close();
  };

  public newPage = async () => {
    const page = await this.browser.newPage();

    page.on('request', (req) => {
      if (PuppeteerHelper.urlDeny.some((url) => req.url().startsWith(url))) {
        PuppeteerHelper.log(`Blocking request and exiting: "${req.url()}"`);
        this.close();
      }
    });

    page.on('response', (res) => {
      const responseUrl = res.url();
      const remoteAddressIP = res.remoteAddress().ip;

      if (responseUrl && PuppeteerHelper.urlDeny.some((url) => responseUrl.startsWith(url))) {
        PuppeteerHelper.log(`Blocking request URL and exiting: "${responseUrl}"`);
        this.close();
      }

      if (remoteAddressIP && PuppeteerHelper.ipDeny.some((ip) => remoteAddressIP.startsWith(ip))) {
        PuppeteerHelper.log(`Blocking request IP and exiting: "${responseUrl}"`);
        this.close();
      }
    })

    page.on('pageerror', (err) => {
      PuppeteerHelper.log(`Page error emitted: "${err.message}"`);
    });

    page.on('response', (res) => {
      if (!res.ok()) {
        PuppeteerHelper.log(`[${res.status()}]: "${res.url()}"`);
      }
    });

    this.page = page;

    return page;
  };
}

Feel free to treat this as a starting place of sorts to contain your management code for puppeteer. There's a lot you can do here, for instance tracking DOM changes over time or even all network requests. While these are definitely suggestions we make it's entirely up to your use-case on what is or isn't interesting to note.

Where do we go from here?

This is just the starting point! We haven't talked about a lot of things you'll likely need to consider when running puppeteer:

How do we ensure we're not running too much traffic?
Can we make our scripts more reliable?
Tools and other things to consider?!

Be sure to follow us on our blog to get the best practices and more. If you're thinking about running your own Chrome cluster of instances, we'd love to hear your thoughts and offer some help!