The Chrome team made waves last year when it released , a NodeJS API for running headless Chrome instances. It represents a marked improvement both in terms of speed and stability over existing solutions like and , and was named . However, it is not without its own set of warts, and getting Puppeteer running smoothly for large jobs can bring its own set of complexities (at , we use Puppeteer to scrape and render Javascript from millions of web pages each month). Here are a few lessons we’ve learned. Google Puppeteer PhantomJS Selenium one of the ten best web scraping tools of 2018 web scraping Scraper API Using Browserless Running headless Chrome instances on the same server as your application code is generally a bad idea as CPU and RAM usage can be unpredictable. In order to avoid a spike in CPU usage from taking down your application server as well, it is a good idea to run headless Chrome on its own server. Luckily, this is incredibly easy with the library. Here are the settings we use in production: Browserless docker pull browserless/chrome docker run -p : -e -e -e -e -e -e --restart always browserless/chrome 3000 3000 "MAX_CONCURRENT_SESSIONS=5" "MAX_QUEUE_LENGTH=0" "PREBOOT_CHROME=true" "TOKEN=YOURTOKEN" "ENABLE_DEBUGGER=false" "CONNECTION_TIMEOUT=300000" These settings time out Chrome sessions after 5 minutes (this is to prevent stray sessions from running indefinitely and eventually crashing your server), and allow up to 5 sessions at any given time. 5 concurrent sessions seems to be a sweet spot that runs comfortably on a $5 Digital Ocean VPS. Browser Settings There are a few browser-level Puppeteer settings you should know about to speed up your browser instances: browser = puppeteer.connect({ : + browserless.ip + + browserless.port + + browserless.token + + proxy + + + + + + }); browser = puppeteer.launch({ : [ + proxy, , , , , , , ], }); // with browserless await browserWSEndpoint 'ws://' ':' '?TOKEN=' '&--proxy-server=' '&--window-size=1920x1080' '&--no-sandbox=true' '&--disable-setuid-sandbox=true' '&--disable-dev-shm-usage=true' '&--disable-accelerated-2d-canvas=true' '&--disable-gpu=true' // without browserless await args '--proxy-server=' '--no-sandbox' '--disable-setuid-sandbox' '--disable-dev-shm-usage' '--disable-accelerated-2d-canvas' '--disable-gpu' '--window-size=1920x1080' Because the Puppeteer library is still quite young and being very actively developed, some of these flags may be already on by default by the time you read this, basically these are sensible defaults that we’ve found in Github issues like and while debugging errors. They will ensure that you don’t run into the same cross platform and hard-to-debug memory errors that we ran into. this this Page Settings Scraping a web page requires creating a new Page (this is what Puppeteer calls creating a new browser tab), navigating to the correct page, and returning the HTML. Here are the Page-level settings we are using. blockedResourceTypes = [ , , , , , , , , ]; skippedResources = [ , , , , , , , , , , , , , , , , , , , ]; page = browser.newPage(); page.setRequestInterception( ); page.setUserAgent(userAgent); page.on( , request => { requestUrl = request._url.split( )[ ].split( )[ ]; ( blockedResourceTypes.indexOf(request.resourceType()) !== || skippedResources.some( requestUrl.indexOf(resource) !== ) ) { request.abort(); } { request.continue(); } }); response = page.goto(url, { : , : , }); (response._status < ) { page.waitFor( ); html = page.content(); html; } const 'image' 'media' 'font' 'texttrack' 'object' 'beacon' 'csp_report' 'imageset' const 'quantserve' 'adzerk' 'doubleclick' 'adition' 'exelator' 'sharethrough' 'cdn.api.twitter' 'google-analytics' 'googletagmanager' 'google' 'fontawesome' 'facebook' 'analytics' 'optimizely' 'clicktale' 'mixpanel' 'zedo' 'clicksor' 'tiqcdn' const await await true await 'request' const '?' 0 '#' 0 if -1 => resource -1 else const await timeout 25000 waitUntil 'networkidle2' if 400 await 3000 let await return There are a few things to notice here. Puppeteer has a waitUntil option, that allows you to define when a page is finished loading. ‘networkidle2’ means that there are no more than 2 active requests open. This is a good setting because for some websites (e.g. websites using websockets) there will always be connections open, so using ‘networkidle0’ your connections will time out every time. Here is the full documentation for . We then wait for an additional 3 seconds after there are only 2 active requests left to let the last two requests finish, and then return the HTML (after checking that the response status code is not an error). waitUntil When scraping at scale, you may not want to download all of the files on each web page, especially larger files like images. You can intercept requests by using the setRequestInterception command, and block requests that you don’t need to be making. You can see the documentation for Puppeteer resource types . You can block any domain or subdomain just by adding it to the skippedResources list. here Using Proxies When scraping a large number of pages on a single website, it may be necessary to use a to . One with Puppeteer is that proxies can only be set at the Browser level, not the Page level, so each Page (browser tab) must use the same proxy. To use different proxies with each page, you will need to use the module. Because Puppeteer/Chromium have some issues with , it is safest to use the User-Agent header which is reliably set on each request. Simply set up your proxy server to read the User-Agent from the request, and use a different proxy for each User-Agent. Here is a sample proxy server. proxy service avoid blocks common issue proxy-chain stripping headers proxies = { : , : , : , }; server = ProxyChain.Server({ : , : { userAgent = request.headers[ ]; proxy = proxies[userAgent]; { : proxy, }; }); }); server.listen( .log( )); const 'useragent1' 'http://proxyusername1:proxypassword1@proxyhost1:proxyport1' 'useragent2' 'http://proxyusername2:proxypassword2@proxyhost2:proxyport2' 'useragent3' 'http://proxyusername3:proxypassword3@proxyhost3:proxyport3' const new port 8000 prepareRequestFunction ( ) => {request} const 'user-agent' const return upstreamProxyUrl => () console 'proxy server started' You can connect to this proxy server by following the example in the Browser Settings section above. This will allow you to set a different proxy server for each new Page based on the Page’s User-Agent, and will also allow you to connect to proxies that require password authentication (which Puppeteer does not currently support). Hopefully this helps some of you avoid the painful edge cases we’ve encountered with Puppeteer. Happy scraping!