Streamlining Workflow with Puppeteer

Written by altynberg | Published 2023/09/12
Tech Story Tags: javascript | puppeteer | nodejs | web-scraping | puppeteer-tutorial | workflow-automation | problem-solving | web-development

TLDRI am a co-founder of a product called Alto's POS & Inventory. Every day, I review charts, logs, and numbers from multiple platforms to gain a complete understanding of whether everything is okay. On average, this process takes about 15 minutes per day, but sometimes it can extend to 30 minutes. After performing some basic calculations, I realized that over the course of one year, it amounts to more than 12 workdays. That's a substantial amount of time—12 workdays per year—that I spend on repetitive tasks. Consequently, I have decided to address this issue and allocate this time to more productive tasks.via the TL;DR App

About the issue

I am a co-founder of a product called Alto's POS & Inventory. Every day, I review charts, logs, and numbers from multiple platforms to gain a complete understanding of whether everything is okay.

On average, this process takes about 15 minutes per day, but sometimes it can extend to 30 minutes. After performing some basic calculations, I realized that over the course of one year, it amounts to more than 12 workdays. That's a substantial amount of time—12 workdays per year—that I spend on repetitive tasks. Consequently, I have decided to address this issue and allocate this time to more productive tasks.

Requirements

Now that we've identified the problem, let's outline the requirements for the solution:

  • A single-page dashboard that displays all the crucial information.
  • The ability to collect data from multiple platforms.
  • Processing the collected data, mixing it if necessary, and extracting concise insights.
  • Immediate display on the dashboard as a warning and notification if any action requires my attention.
  • Easy accessibility for both me and my co-founders on any device and from anywhere to view this dashboard.

About my solution

The solution should be worth the effort. For instance, if I have to invest a significant amount of time in resolving this issue, it might be better to leave it as it is now. So, let's strive to find a solution that is easy to implement. Solutions are listed from the easiest to the most challenging:

Notifications

In my case, the easiest way to receive notifications is to use Telegram Bots. I can quickly set up and start using it immediately.

Hosting

To make the dashboard accessible on any device, at any time and from anywhere, it should be hosted on services like AWS or IONOS.

Data collection

Initially, I tried to find existing services that would allow me to easily gather data, but I couldn't find anything that suited my needs. So, I decided to explore the possibility of creating a custom solution.

My approach involved collecting data from the platforms we use through their APIs, processing that data, and presenting it on a dashboard. However, I encountered a challenge as many of these platforms didn't have available APIs for data access. This forced me to explore alternative methods, and I found that Puppeteer was the most suitable choice for my specific requirements.

Puppeteer

Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Most things that you can do manually in the browser can be done using Puppeteer! Here are a few examples:

  • Generate screenshots and PDFs of pages.
  • Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
  • Automate form submission, UI testing, keyboard input, etc.
  • Create an automated testing environment using the latest JavaScript and browser features.
  • Capture a timeline trace of your site to help diagnose performance issues.
  • Test Chrome Extensions.

Here's a simple example of Puppeteer in action, which performs the following steps:

  1. Navigates to Google.com.
  2. Executes a search for "HackerNoon"
  3. Clicks on the first search result.
  4. Takes a screenshot.

import puppeteer from 'puppeteer';

(async (searchValue) => {
  // Launch the browser and open a new blank page
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Search
  await page.goto('https://google.com');
  await page.locator('textarea').fill(searchValue);
  await page.$eval('form', form => form.submit());

  // Go to the first link
  await page.waitForNavigation();
  await page.click(`div[data-async-context^="query:"] a`);

  // Take a screenshot
  await page.waitForNavigation();
  await page.screenshot({path: './screenshot.png'});

  await browser.close();

})("HackerNoon");

Idea validation

Let's validate the idea with a simple implementation that closely resembles a real scenario.

Since we need to gather data from multiple platforms, let's choose a simpler scenario that is similar to this task, such as building a page that displays HackerNoon top stories and software engineering jobs.

The following code collects data from HackerNoon Top Stories and HackerNoon Jobs every hour, generates simple HTML content from this data, and then serves this HTML content when we receive an HTTP request. It's quite straightforward.

index.js:

import http from 'http';
import * as scraper from './scraper.js';

(async () => {
  let scrapedHtml = 'Try again later...';
  http.createServer((req, res) => {
    res.writeHead(200, {'Content-Type': 'text/html; charset=utf-8'})
    res.end(scrapedHtml);
  
  }).listen(8080);

  scrapedHtml = await scrapeAll();
  setInterval(async () => scrapedHtml = await scrapeAll(), 60*60*1000);
})();

async function scrapeAll() {
  const browser = await scraper.launchBrowser();
  const [stories, jobs] = await Promise.all([
    scraper.getTopStories(browser),
    scraper.getJobs('Software Engineer', browser)
  ]);
  await browser.close();
  return `
    <h2>Top Stories</h2>
    <ul>${stories.map(e => linkToHtml(e.title, e.url)).join('')}</ul>

    <h2>Jobs</h2>
    <ul>${jobs.map(e => linkToHtml(e.title, e.url)).join('')}</ul>
  `;
}

const linkToHtml = (title, url) => {
  return `<li>
    <a target="_blank" href="${url}">
      ${title}
    </a>
  </li>`;
}

scraper.js:

import puppeteer, {Browser} from 'puppeteer';

/**
 * 
 * @returns {Browser}
 */
export async function launchBrowser() {
  return await puppeteer.launch();
}

/**
 * 
 * @param {Browser} browser
 * @returns {[{title: String, url: String}]}
 */
export async function getTopStories(browser) {
  const page = await browser.newPage();
  await page.goto('https://hackernoon.com/tagged/hackernoon-top-story');

  // Wait for articles
  await page.waitForSelector('main .story-card');

  // Get articles
  const res = [];
  const articles = await page.$$('main .story-card h2 a');
  for (const article of articles) {
    res.push(
      await article.evaluate(el => ({
        "title": el.textContent,
        "url": el.href,
      }))
    );
  }
  return res;
}

/**
 * 
 * @param {String} keyword
 * @param {Browser} browser
 * @returns {[{title: String, url: String}]}
 */
export async function getJobs(keyword, browser) {
  const page = await browser.newPage();
  await page.goto('https://jobs.hackernoon.com');

  // Search
  await page.locator('#search-jobkeyword input').fill(keyword);
  await page.click('button[type=submit]');

  // Wait for result
  await page.waitForSelector('.job-list-item');

  // Get jobs
  const res = [];
  const items = await page.$$('.job-list-item');
  for (const item of items) {
    res.push(
      await item.evaluate(el => ({
        "title": [
          el.querySelector('.job-name'),
          ...el.querySelectorAll('.desktop-view span')
        ].map(e => e.textContent).join(', '),
        "url": el.href,
      }))
    );
  }
  return res;
}

The result looks something like this:

Advantages

In my perspective, this solution offers the following advantages:

  • Relatively quick implementation
  • A single method of data extraction
  • Ease of extension if I need to add an additional platform

Disadvantages

From my point of view, the disadvantages of this solution are as follows:

  • Requires adjustment if the user interface of any platform changes
  • It may be challenging or impossible to extract data from certain platforms, particularly those with enhanced security measures.

Conclusion

This solution proved effective for my situation, allowing me to resolve my problem quickly. If you believe there's a better approach to solving it, I would be delighted if you could share your insights.

You can access the source code for this example on GitHub.


Written by altynberg | Software Engineer
Published by HackerNoon on 2023/09/12