Maciej Cieślar

@maciejcieslar

How I Used My Programming Skills to Save Over 8 Hours of Writing Work

Photo by Kevin Ku on Unsplash

Recently on Facebook David Smooke (the CEO of Hackernoon) posted an article in which he listed 2018’s Top Tech Stories. He also mentioned that if someone wished to make a similar list about say JavaScript he would be happy to feature it on the frontpage of Hackernoon.

In a constant struggle to get more people to read my work I could not miss this opportunity, so I immediately started to plan how to approach making such a list.

Since the year was coming to an end and I had limited time I decided not to search for the posts by hand but use my web-scraping skills instead.

I believe learning how to make such a scraper can be a useful exercise and serve as an interesting case study.

If you have read my article about How I created an instagram bot then you know that the best way to interact with websites with Node.js is to use the puppeteer library which controls a chromium instance. This way we can do everything a potential user could do on a website.

Here is the link to the repository.

Creating a scraper

Let’s abstract away creating a puppeteer’s browser and pages with this simple helper:

const createBrowser = async () => {
const browser = await puppeteer.launch({ headless: true })

return async function getPage<T>(url: string, callback: (page: puppeteer.Page) => Promise<T>) {
const page = await browser.newPage()

try {
await page.goto(url, { waitUntil: 'domcontentloaded' })

page.on('console', (msg) => console.log(msg.text()))

const result = await callback(page)

await page.close()

return result
} catch (e) {
await page.close()

throw e
}
}
}

We use the page inside a callback, so that we can avoid repeating the same code over and over again. Thanks to this helper we don’t need to worry about going to a given url, listening to console.logs from inside the page.evaluate and closing the page after everything is done. The result of the function is going to be returned inside a promise so we can just await it later and don’t have to use the result inside the callback.

Let’s talk about the data

There is a website where we can find all the articles with the JavaScript tag published by Hackernoon. They are sorted by date but sometimes out of nowhere there is an article published way before, like 2016, so we have to watch out for this.

We can extract all the needed information from this post preview alone — without actually opening the post in a new tab which makes our work much easier.

In the box shown above we see that all the data we want:

  1. Author’s name and url to his/her profile
  2. Title of an article and url
  3. Number of claps
  4. Read time
  5. Date

Here’s the interface of an article:

interface Article {
articleUrl: string
date: string
claps: number
articleTitle: string
authorName: string
authorUrl: string
minRead: string
}

On Medium there is an infinite scroll which means that as we scroll down more articles are loaded. If we were to use GET requests to get the static HTML and parse it with a library such as JSDOM then getting those articles would be impossible because we can’t use scroll with static HTML. That is why puppeteer is a life-saver when it comes to any kind of interaction with a website.

To get all the loaded posts we can use:

Array.from(document.querySelectorAll('.postArticle'))
.slice(offset)
.map((post) => {})

Now we can use each post as a context for the selectors — instead of writing document.querySelector we are now going to write post.querySelector. This way we can restrict the search only to a given post element.

Also, notice the .slice(offset) snippet - since we are scrolling down and not opening a new page, the already parsed articles are still there. Of course we could parse them again but that would not be really effective. Offset starts at 0 and everytime we scrap some articles we add the length of the collection to the offset.

offset += scrapedArticles.length

Scraping the data of a post

The most popular error when it comes to scraping data is “Cannot read property ‘textContent’ of null”. We are going to create a simple helper function that prevents us from ever trying to get a property of a non-existing element.

function safeGet<T extends Element, K>(
element: T,
callback: (element: T) => K,
fallbackValue = null,
): K {
if (!element) {
return fallbackValue
}

return callback(element)
}

safeGet will only execute the callback if the element exists. Now let’s use it to access the properties of elements holding the data we are interested in.

Date when an article was published

const dateElement = post.querySelector('time')
const date = safeGet(
dateElement,
(el) => new Date(el.dateTime).toUTCString(),
'',
)

Should something happen with dateElement and it was not found our safeGet will prevent errors. <time> element has an attribute called dateTime which holds a string representation of the date when the article was published.

const authorDataElement = post.querySelector<HTMLLinkElement>(
'.postMetaInline-authorLockup a[data-action="show-user-card"]',
)

const { authorUrl, authorName } = safeGet(
authorDataElement,
(el) => {
return {
authorUrl: removeQueryFromURL(el.href),
authorName: el.textContent,
}
},
{},
)

Inside this <a> element we can find both a user’s profile URL and his/her name.

Also, here we use removeQueryFromURL because both the author’s profile URL and post’s URL have this weird source parameter in the query that we would like to remove:

https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1?source=———1———————

The ? character in a URL denotes the start of query parameters, so let’s simply remove everything after it.

const removeQueryFromURL = (url: string) => url.split('?').shift()

We split the string at ? and return only the first part.

Claps

In the example post above we see that the number of “claps” is 204 which is accurate. However, once the numbers exceed 1000 they are displayed as 1K, 2K, 2.5K. This could be a problem if we needed the exact number of claps. In our use case this rounding works just fine.

const clapsElement = post.querySelector('span > button')

const claps = safeGet(
clapsElement,
(el) => {
const clapsString = el.textContent

if (clapsString.endsWith('K')) {
return Number(clapsString.slice(0, -1)) * 1000
}

return Number(clapsString)
},
0,
)

If the string representation of claps ends with K we just remove the K letter and multiply it by 1000 — pretty straightforward stuff.

Article’s url and title

const articleTitleElement = post.querySelector('h3')
const articleTitle = safeGet(
articleTitleElement,
(el) => el.textContent
)
const articleUrlElement = post.querySelector<HTMLLinkElement>(
'.postArticle-readMore a',
)
const articleUrl = safeGet(
articleUrlElement,
(el) => removeQueryFromURL(el.href)
)

Again, since the selectors are used inside the post context we don’t need to get overly specific with their structure.

“Min read”

const minReadElement = post.querySelector<HTMLSpanElement>('span[title]')
const minRead = safeGet(minReadElement, (el) => el.title)

Here we use a somewhat different selector: we look for a <span> that contains data-title property.

Note: Later we are going to be working with .title property, thus it is important to make a distinction between them.

Ok, we have now scraped all the articles currently displayed on the page but how do we scroll to load more articles?

Scroll to load more articles

// scroll to the bottom of the page
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight)
})

// wait to fetch the new articles
await page.waitFor(7500)

We scroll the page to the bottom and wait for 7.5 seconds. This is a “safe” time — the articles could load in 2 seconds but we would rather be sure that all posts are loaded than to miss some. If time was an important factor we would probably set some interceptor at the request which would fetch the posts and move on once it’s done.

When to end the scraping

If the posts were sorted by date we could stop the scraping the moment we came across an article from 2017. However, since there are some weird cases of old articles showing up in between the articles from 2018 we cannot do this. What we can do instead is filter the scraped articles for those published in 2018 or later. If the resulting array is empty we can safely assume that there are no more articles we are interested in. In matchingArticles we keep the articles that were posted in 2018 or later and in parsedArticles we have only the articles that were posted in 2018.

const matchingArticles = scrapedArticles.filter((article) => {
return article && new Date(article.date).getFullYear() >= 2018
})

if (!matchingArticles.length) {
return articles
}

const parsedArticles = matchingArticles.filter((article) => {
return new Date(article.date).getFullYear() === 2018
})

articles = [...articles, ...parsedArticles]

If matchingArticles is empty we return all articles and thus end the scraping.

Putting it all together

Here is the entire code needed to get the articles:

const scrapArticles = async () => {
const createPage = await createBrowser()

return createPage<Article[]>('https://hackernoon.com/tagged/javascript', async (page) => {
let articles: Article[] = []
let offset = 0

while (true) {
console.log({ offset })

const scrapedArticles: Article[] = await page.evaluate((offset) => {
function safeGet<T extends Element, K>(
element: T,
callback: (element: T) => K,
fallbackValue = null,
): K {
if (!element) {
return fallbackValue
}

return callback(element)
}

const removeQueryFromURL = (url: string) => url.split('?').shift()

return Array.from(document.querySelectorAll('.postArticle'))
.slice(offset)
.map((post) => {
try {
const dateElement = post.querySelector('time')
const date = safeGet(dateElement, (el) => new Date(el.dateTime).toUTCString(), '')

const authorDataElement = post.querySelector<HTMLLinkElement>(
'.postMetaInline-authorLockup a[data-action="show-user-card"]',
)

const { authorUrl, authorName } = safeGet(
authorDataElement,
(el) => {
return {
authorUrl: removeQueryFromURL(el.href),
authorName: el.textContent,
}
},
{},
)

const clapsElement = post.querySelector('span > button')

const claps = safeGet(
clapsElement,
(el) => {
const clapsString = el.textContent

if (clapsString.endsWith('K')) {
return Number(clapsString.slice(0, -1)) * 1000
}

return Number(clapsString)
},
0,
)

const articleTitleElement = post.querySelector('h3')
const articleTitle = safeGet(articleTitleElement, (el) => el.textContent)

const articleUrlElement = post.querySelector<HTMLLinkElement>(
'.postArticle-readMore a',
)
const articleUrl = safeGet(articleUrlElement, (el) => removeQueryFromURL(el.href))

const minReadElement = post.querySelector<HTMLSpanElement>('span[title]')
const minRead = safeGet(minReadElement, (el) => el.title)

return {
claps,
articleTitle,
articleUrl,
date,
authorUrl,
authorName,
minRead,
} as Article
} catch (e) {
console.log(e.message)
return null
}
})
}, offset)

offset += scrapedArticles.length

// scroll to the bottom of the page
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight)
})

// wait to fetch the new articles
await page.waitFor(7500)

const matchingArticles = scrapedArticles.filter((article) => {
return article && new Date(article.date).getFullYear() >= 2018
})

if (!matchingArticles.length) {
return articles
}

const parsedArticles = matchingArticles.filter((article) => {
return new Date(article.date).getFullYear() === 2018
})

articles = [...articles, ...parsedArticles]

console.log(articles[articles.length - 1])
}
})
}

Before we save the data in a proper format let’s sort the articles by claps in descending order:

const sortArticlesByClaps = (articles: Article[]) => {
return articles.sort(
(fArticle, sArticle) => sArticle.claps - fArticle.claps
)
}

Now let’s output the articles to a readable format because so far they only exist inside the memory of our computer.

Output formats

JSON

We can use the JSON format to dump all the data into a single file. Having all the articles stored this way may come in handy sometime in the future.

Converting to the JSON format comes down to typing:

const jsonRepresentation = JSON.stringify(articles)

We could stop right now with the JSON representation of articles and just copy and paste into our list the articles we believe belong there. But, as you can imagine, this can also be automated.

HTML

The HTML format will surely make it easier to just copy and paste an item from the list than to manually copy everything from the JSON format.

David in his article listed the articles in the following manner:

David’s list format

We would like to have our list be in a format like this. We could, again, use puppeteer to create and operate on HTML elements but, since we are working with HTML, we can just embed the values inside a string — browser is going to parse them anyways.

const createHTMLRepresentation = async (articles: Article[]) => {
const list = articles
.map((article) => {
return `
<li>
<a href="${article.articleUrl}">${article.articleTitle}</a> by
<a href="${article.authorUrl}">${article.authorName}</a>
[${article.minRead}] (${article.claps})
</li>
`
})
.join('')

return `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta http-equiv="X-UA-Compatible" content="ie=edge" />
<title>Articles</title>
</head>
<body>
<ol>
${list}
</ol>
</body>
</html>
`
}

As you can see we just .map() over articles and return a string containing the data formatted the way we like. We now have an array with <li> elements - each representing an article. Now we just have to .join() them to create a string and embed it inside a simple HTML5 template.

Saving the files

Last thing left to do is to save the representations in separate files.

const scrapedArticles = await scrapArticles()
const articles = sortArticlesByClaps(scrapedArticles)

console.log(`Scrapped ${articles.length} articles.`)

const jsonRepresentation = JSON.stringify(articles)
const htmlRepresentation = createHTMLRepresentation(articles)
await Promise.all([
fs.writeFileAsync(jsonFilepath, jsonRepresentation),
fs.writeFileAsync(htmlFilepath, htmlRepresentation),
])

The results

According to the scraper there were 894 articles with the JavaScript tag published this year on Hackernoon which averages 2.45 article a day.

Here’s what the HTML file looks like:

<li>
<a href="https://hackernoon.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5">I’m harvesting credit card numbers and passwords from your site. Here’s how.</a> by
<a href="https://hackernoon.com/@david.gilbertson">David Gilbertson</a>
[10 min read] (222000)
</li>

<li>
<a href="https://hackernoon.com/part-2-how-to-stop-me-harvesting-credit-card-numbers-and-passwords-from-your-site-844f739659b9">Part 2: How to stop me harvesting credit card numbers and passwords from your site</a> by
<a href="https://hackernoon.com/@david.gilbertson">David Gilbertson</a>
[16 min read] (18300)
</li>

<li>
<a href="https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1">JAVASCRIPT 2018 — TOP 20 HACKERNOON ARTICLES OF THE YEAR</a> by
<a href="https://hackernoon.com/@maciejcieslar">Maciej Cieślar</a>
[2 min read] (332)
</li>

And now the JSON file:

[
{
"claps": 222000,
"articleTitle": "I’m harvesting credit card numbers and passwords from your site. Here’s how.",
"articleUrl": "https://hackernoon.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5",
"date": "Sat, 06 Jan 2018 08:48:50 GMT",
"authorUrl": "https://hackernoon.com/@david.gilbertson",
"authorName": "David Gilbertson",
"minRead": "10 min read"
},
{
"claps": 18300,
"articleTitle": "Part 2: How to stop me harvesting credit card numbers and passwords from your site",
"articleUrl": "https://hackernoon.com/part-2-how-to-stop-me-harvesting-credit-card-numbers-and-passwords-from-your-site-844f739659b9",
"date": "Sat, 27 Jan 2018 08:38:33 GMT",
"authorUrl": "https://hackernoon.com/@david.gilbertson",
"authorName": "David Gilbertson",
"minRead": "16 min read"
},
{
"claps": 218,
"articleTitle": "JAVASCRIPT 2018 -- TOP 20 HACKERNOON ARTICLES OF THE YEAR",
"articleUrl": "https://hackernoon.com/javascript-2018-top-20-hackernoon-articles-of-the-year-9975563216d1",
"date": "Sat, 29 Dec 2018 16:26:36 GMT",
"authorUrl": "https://hackernoon.com/@maciejcieslar",
"authorName": "Maciej Cieślar",
"minRead": "2 min read"
}
]

I have probably saved myself a good 7–8 hours by creating a scraper which did all of the tedious, mind-numbing work for me. Once it was done all that was left to do was to review the top articles and choose what to put in the article. The code took about an hour to create whereas copying and pasting all the data by hand (let alone saving in both HTML and JSON formats) would easily take a lot more.

Here is the article, if you are interested in seeing what I chose to put in the list.

Originally published at www.mcieslar.com on January 7, 2019.

More by Maciej Cieślar

Topics of interest

More Related Stories