How To Convert HTML to Markdown with a Serverless Function

Outlined below is the setup for a AWS lambda function which combines
fetching the HTML for a URL, stripping it back to just the essential
article content, and then converting it to Markdown. To deploy it you’ll
need an AWS account, and to have the serverless framework installed.

Step 1 - Download the full HTML for the URL

First get the full html of the url getting converted. As this is
running in a lambda function I decided to try out an ultra-lightweight
node http client called phin (which is 95% smaller than my usual favourite Axios):

const phin = require('phin')
const fetchPageHtml  async fetchUrl => {
  const response = await phin(fetchUrl)
  return response.body;
};

Step 2 - Convert to readable HTML

Converting to readable HTML is a feature originally offered by Instapaper (going back to 2008) as part of the core experience of a "read it later" service, but is now built into most browsers. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way.

This process won't work for every web page - it is designed for blog posts, news articles etc which have a clear "body content" section which can be the focus of the output.

Mozilla have open sourced their code for doing this in a Readability library, which can be reused here:

const readability = require("readability");
const JSDOM = require("jsdom").JSDOM;

const extractMainContent = (pageHtml, url) => {
  const doc = new JSDOM(pageHtml, {
    url,
  });
  const reader = new Readability(doc.window.document);
  const article = reader.parse();
  return article.content;
};

This returns the HTML for just the article in a more readable form.

Step 3 - Convert readable HTML to markdown

There is a CLI tool called pandoc which converts HTML to markdown. The elevator pitch for pandoc is:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

To try this out locally before running it from the lambda function, you can follow one of their installation methods, and then test it from the command line by piping a html file as the input:

cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none

The options used here are:

```
-f html
```
is the input format
```
-t commonmark
```
is the output format (a particular markdown flavour)

You can add extra configuration options to the output by adding them to the output name.

commonmark-raw_html+backtick_code_block

sets the converter to disable the

raw_html

extension, so no plain html is included in the output.

It enables the

backtick_code_blocks

extension so that any code blocks are fenced with backticks rather than being indented.

The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the html to the child

stdin

and then collect the markdown output via the child

stdout

Most of these functions have been taken from this very helpful blog post on working with stdout and stdin in nodejs.

First off this is the generic streamWrite function, which allows you to pipe the html to the pandoc process, by writing to the

stdin

stream of the child process.

const streamWrite = async (stream, chunk, encoding = 'utf8') =>
  new Promise((resolve, reject) => {
    const errListener = (err) => {
      stream.removeListener('error', errListener);
      reject(err);
    };
    stream.addListener('error', errListener);
    const callback = () => {
      stream.removeListener('error', errListener);
      resolve(undefined);
    };
    stream.write(chunk, encoding, callback);
  });

This similar function reads from the

stdout

stream of the child process, so you can collect the markdown that is output:

const {chunksToLinesAsync, chomp} = require('@rauschma/stringio');
const collectFromReadable = async (readable) => {
  let lines = [];
 for await (const line of chunksToLinesAsync(readable)) {
   lines.push(chomp(line));
 }
 return lines;
}

Finally this helper function converts the callback events for the child process into an “awaitable” async function:

const onExit = async (childProcess) =>
  new Promise((resolve, reject) => {
    childProcess.once('exit', (code) => {
      if (code === 0) {
        resolve(undefined);
      } else {
        reject(new Error('Exit with error code: '+code));
      }
    });
    childProcess.once('error', (err) => {
      reject(err);
    });
  });

To make the API a bit cleaner, here is that all wrapped up in a single helper function:

// spawns a child process, supplying stdin to the child STDIN, then reads from the child STDOUT and
// returns this as a string
const spawnHelper = async (command, stdin) => {
  const commandParts = command.split(" ");
  const childProcess = spawn(commandParts[0], commandParts.slice(1))
  await streamWrite(childProcess.stdin, stdin);
  childProcess.stdin.end();
  const outputLines = await collectFromReadable(childProcess.stdout);
  await onExit(childProcess);
  return outputLines.join("\n");
}

This makes calling pandoc from the node script much simpler:

const convertToMarkdown = async (html) => {
  const convertedOutput = await spawnHelper('/opt/bin/pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none', html)
  return convertedOutput;
}

To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a
precompiled pandoc binary. You can build this yourself, or just include the public published layer in your serverless config.

# function config
layers:
  - arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1

Step 4 - Wrapping this up in the lambda handler function

Export a function from this module which has been configured as the
handler. This is the function AWS will run every time the lambda
receives a request.

module.exports.endpoint = async (event) => {
  const url = event.body
  const pageHtml = await fetchPageHtml(url);
  const article = await extractMainContent(pageHtml, url);
  const bodyMarkdown = await convertToMarkdown(article.content);
  // add the title and source url to the top of the markdown
  const markdown = `# ${article.title}\n\nSource: ${url}\n\n${bodyMarkdown}`
  return {
    statusCode: 200,
    body: markdown,
    headers: {
      'Content-type': 'text/markdown'
    }
  }
}

This is the full

serverless.yml

configuration that is needed for serverless to deploy everything:

service: url-to-markdown

frameworkVersion: ">=1.1.0 <2.0.0"

provider:
  name: aws
  runtime: nodejs12.x
  region: us-east-1

functions:
  downloadAndConvert:
    handler: handler.endpoint
    timeout: 10
    layers:
      - arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1
    events:
      - http:
          path: convert
          method: post

Wrap Up

The full source code is available on github. Once deployed you can test it from the command line like so:

curl -X POST -d 'https://www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska' https://zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert
```

Previously published at https://michaelvigor.dev/serverless-url-to-markdown/

How To Convert HTML to Markdown with a Serverless Function

Too Long; Didn't Read

Company Mentioned

Step 1 - Download the full HTML for the URL

Step 2 - Convert to readable HTML

Step 3 - Convert readable HTML to markdown

Step 4 - Wrapping this up in the lambda handler function

Wrap Up

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

How To Convert HTML to Markdown with a Serverless Function

Too Long; Didn't Read

Company Mentioned

Step 1 - Download the full HTML for the URL

Step 2 - Convert to readable HTML

Step 3 - Convert readable HTML to markdown

Step 4 - Wrapping this up in the lambda handler function

Wrap Up

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics