Outlined below is the setup for a AWS lambda function which combines fetching the HTML for a URL, stripping it back to just the essential article content, and then converting it to Markdown. To deploy it you’ll need an AWS account, and to have the installed. serverless framework Step 1 - Download the full HTML for the URL First get the full html of the url getting converted. As this is running in a lambda function I decided to try out an ultra-lightweight node http client called (which is 95% smaller than my usual favourite Axios): phin phin = ( ) fetchPageHtml fetchUrl => { response = phin(fetchUrl) response.body; }; const require 'phin' const async const await return Step 2 - Convert to readable HTML Converting to readable HTML is a feature originally offered by Instapaper (going back to 2008) as part of the core experience of a "read it later" service, but is now built into most browsers. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way. This process won't work for every web page - it is designed for blog posts, news articles etc which have a clear "body content" section which can be the focus of the output. Mozilla have open sourced their code for doing this in a library, which can be reused here: Readability readability = ( ); JSDOM = ( ).JSDOM; extractMainContent = { doc = JSDOM(pageHtml, { url, }); reader = Readability(doc.window.document); article = reader.parse(); article.content; }; const require "readability" const require "jsdom" const ( ) => pageHtml, url const new const new const return This returns the HTML for just the article in a more readable form. Step 3 - Convert readable HTML to markdown There is a CLI tool called which converts HTML to markdown. The elevator pitch for pandoc is: pandoc If you need to convert files from one markup format into another, pandoc is your swiss-army knife. To try this out locally before running it from the lambda function, you can follow one of their , and then test it from the command line by piping a html file as the input: installation methods cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none The options used here are: is the input format -f html is the output format (a particular markdown flavour) -t commonmark You can add extra configuration options to the output by adding them to the output name. sets the converter to disable the extension, so no plain html is included in the output. commonmark-raw_html+backtick_code_block raw_html It enables the extension so that any code blocks are fenced with backticks rather than being indented. backtick_code_blocks The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the html to the child and then collect the markdown output via the child . stdin stdout Most of these functions have been taken from on working with stdout and stdin in nodejs. this very helpful blog post First off this is the generic streamWrite function, which allows you to pipe the html to the pandoc process, by writing to the stream of the child process. stdin streamWrite = (stream, chunk, encoding = ) => ( { errListener = { stream.removeListener( , errListener); reject(err); }; stream.addListener( , errListener); callback = { stream.removeListener( , errListener); resolve( ); }; stream.write(chunk, encoding, callback); }); const async 'utf8' new Promise ( ) => resolve, reject const ( ) => err 'error' 'error' const => () 'error' undefined This similar function reads from the stream of the child process, so you can collect the markdown that is output: stdout {chunksToLinesAsync, chomp} = ( ); collectFromReadable = (readable) => { lines = []; ( line chunksToLinesAsync(readable)) { lines.push(chomp(line)); } lines; } const require '@rauschma/stringio' const async let for await const of return Finally this helper function converts the callback events for the child process into an “awaitable” async function: onExit = (childProcess) => ( { childProcess.once( , (code) => { (code === ) { resolve( ); } { reject( ( +code)); } }); childProcess.once( , (err) => { reject(err); }); }); const async new Promise ( ) => resolve, reject 'exit' if 0 undefined else new Error 'Exit with error code: ' 'error' To make the API a bit cleaner, here is that all wrapped up in a single helper function: spawnHelper = (command, stdin) => { commandParts = command.split( ); childProcess = spawn(commandParts[ ], commandParts.slice( )) streamWrite(childProcess.stdin, stdin); childProcess.stdin.end(); outputLines = collectFromReadable(childProcess.stdout); onExit(childProcess); outputLines.join( ); } // spawns a child process, supplying stdin to the child STDIN, then reads from the child STDOUT and // returns this as a string const async const " " const 0 1 await const await await return "\n" This makes calling pandoc from the node script much simpler: convertToMarkdown = (html) => { convertedOutput = spawnHelper( , html) convertedOutput; } const async const await '/opt/bin/pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none' return To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a precompiled pandoc binary. You can , or just include the in your serverless config. build this yourself public published layer # function config layers: - arn: aws:lambda:us-east-1:145266761615:layer:pandoc:1 Step 4 - Wrapping this up in the lambda handler function Export a function from this module which has been configured as the handler. This is the function AWS will run every time the lambda receives a request. .exports.endpoint = (event) => { url = event.body pageHtml = fetchPageHtml(url); article = extractMainContent(pageHtml, url); bodyMarkdown = convertToMarkdown(article.content); markdown = { : , : markdown, : { : } } } module async const const await const await const await // add the title and source url to the top of the markdown const `# \n\nSource: \n\n ` ${article.title} ${url} ${bodyMarkdown} return statusCode 200 body headers 'Content-type' 'text/markdown' This is the full configuration that is needed for serverless to deploy everything: serverless.yml service: url-to-markdown frameworkVersion: provider: name: aws runtime: nodejs12.x region: us-east functions: downloadAndConvert: handler: handler.endpoint timeout: layers: - arn:aws:lambda:us-east : :layer:pandoc: events: - http: path: convert method: post ">=1.1.0 <2.0.0" -1 10 -1 145266761615 1 Wrap Up The full source code is . Once deployed you can test it from the command line like so: available on github curl -X POST -d https://zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert ``` 'https://www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska' Previously published at https://michaelvigor.dev/serverless-url-to-markdown/