Replicating data from Hacker News (Firebase) to RethinkDB

Image courtesy of the amon.cx blog

A couple of days ago, I posted about my fun side project, called TopHN. Basically, a real time display of the Hacker News top stories. Well, this article proved to be really popular amongst the hacker community and I got plenty of messages and tweets about it, asking me to expand upon my project.

Well, I am more than happy to do this, so what I thought I would do was to publish two or three more blog posts which would delve a bit deeper into the code and how I built my project.

The Components

There are basically two main sections to my TopHN project, and indeed, these run on two separate virtual servers. One server is basically my RethinkDB server that I set up on Digital Ocean’s $10/month VPS running Ubuntu Linux. This server also runs a short (~100 lines) Node.js app which is purely used to read the data from the Hacker News Firebase feed and store a replicated copy in RethinkDB.

The other server is the ‘presentation layer’, and also runs a small Node.js app which is built to serve up the front page of the site, and also to listen to the RethinkDB server in real time and push the changed information to the web browsers that are connected.

Today, I want to talk about the first server, which hosts my RethinkDB database, and the ‘feeder’ Node.js app which grabs data from Hacker News.

Installing RethinkDB

I won’t go into the actual installation steps here too much, because there are already great instructions on the RethinkDB site. Basically, once you have a Digital Ocean (or any other) Linux VPS set up, you simply need to get root console access and follow the step by step instructions on the RethinkDB site to get up and running. It is really easy, which is one of the reasons I went with this database for this project.

Once you have installed and started the RethinkDB server, you can access the control panel by going to

http://<your VPS IP Address>:8081

Tip: I would secure this management portal behind a reverse proxy, and set and admin password so that random people on the internet who figure out your RethinkDB server can’t just log in and manipulate your data. Once again, excellent instructions for doing so are on their site. I highly recommend that you do this before carrying on, but if time is short, you can continue with the instructions in this post and come back to this later.

Setting Up The Data Tables

There are a couple of ways you can do this, via code, or manually do it in the management console. I am going to do it in the console because it is really a ‘once off’ exercise, and saves a few lines of code which may confuse people later.

In the management console click on the ‘Tables’ menu along the top, then click on ‘+ Add Database’. give the database a name. I will call it hn_data for Hacker News data, but you can call it anything you want (as long as you remember it for later in the code).

Once you have created the empty database, click on ‘+ Add Table’ several times and create three tables, called:

hn_feed — this will contain the actual feed of articles and comments.

hn_lists — this will contain the latest lists of ‘top stories’, ‘best stories’, ‘ask HN’, ‘jobs’ etc. stories from Hacker News.

hn_users — this will contain the user profiles that are read from Hacker News.

No need to populate these tables with data now — we can do this with pure code, in a future step.

Creating A RethinkDB User

Before we go much further, we need to create a unique user for RethinkDB which will have read/write access to this database we just created. This is the user we will be invoking from the Node.js app later to push the Hacker News data into our database.

It is a good idea NOT to use the default admin user for this, but instead create a new user which ONLY has access to this database. That way, if the username is compromised, you can easily change their password to re-secure your feed again.

To create a user, we can use a ReQL query from right within the RethinkDB console. Click on the ‘Data Explorer’ menu option along the top, and enter in the following ReQL command:

r.db('rethinkdb').table('users').insert({id: 'hnfeeder', password: 'verysecretpassword'})

Don’t forget to hit ‘Run’ to execute this command after you type it in. And don’t forget to replace the id and password with your ones of your own choosing (and remember them for later).

Next, we want to give this new user full read/write permissions into the hn_data table that we just created, with the following ReQL command:

r.db('hn_data').grant('hnfeeder', {read: true, write: true, config: true});

Don’t forget to hit ‘Run’ again. Now your user hnfeeder has read, write and config rights in the database. Config rights basically means the ability to create new tables etc., so you can actually leave that as false for now because we won’t be doing anything like that at the moment via the app.

That is basically it for the database console at this point.

Installing Node.js

Digital Ocean actually have some great instructions on installing Node.js on Ubuntu in this article. Follow it step by step for best results, and then come back to this post.

The only thing we have to add here is the node modules that we need for this project. There are only two — (a) the RethinkDB module and (b) the Firebase module.

First, change to the folder where you will be creating the actual application. I simply created mine in /root but for better security, you might want to create it in /var/app or similar. Lets stick with /root for now:

cd /root

npm install --save rethinkdb firebase

Creating The Application

Now we are ready to create the Node.js app itself. In the /root folder (or wherever you will be creating the app), create a file called feeder.js, and using your favourite editor, type in (or copy and paste) the following code:

Lets go through this code and see what is happening.

The first two lines are simply activating the node modules we installed earlier.

Lines 4 to 7 are initiating the Firebase connection to the Hacker News API. You can set the appName to whatever you like, but the databaseURL HAS to be exactly as is.

Line 9 is simple the placeholder for the RethinkDB connection that will be stored in rdbconn. This is used later all over the app for conversations with our RethinkDB server.

Lines 11 to 14 initiates the link to our RethinkDB server. Remember to replace host, user, and password with whatever you set earlier. If you are running this app on the same server as you installed RethinkDB on (as we are), then host can simply have the value localhost.

That is all the preliminary connection stuff out of the way. Now we are getting to the nitty gritty of the app.

Lines 16 to 23 are where we set up the references to various Firebase feeds for the HN API. We need to set up a reference for each individual feed we want. More information on the feed locations are on the Hacker News API Documentation site, but basically, there is a unique endpoint for each feed they publish.

For instance, the ‘New Stories’ feed is published at /v0/newstories etc.

In our example, we are really only setting up the New Stories feed, but if you want extra, simply add extra lines with the unique feed endpoints that you need/want.

Line 23 is a pretty important one. This is a special feed from Firebase/HN which contains a list of stories, comments and users that have changed since the last push update. This Updates feed gets pushed out ever 20 seconds or so. As you can imagine, with the level of activity on Hacker News, each push of this Updates feed can contain hundreds of article & comment IDs that have changed, and dozens of user profiles that have changed.

This is the feed that we will be mainly listening to, in order to see what is new on Hacker News, and update our local database accordingly.

Lines 26 to 43 is the pullUser() function. This is the function we call to read a single user from Firebase using their user ID, and then inserting it into our local database. We call this function for every user ID that is pushed out to us in the Updates feed that I spoke about above.

Line 28 here is what calls the Firebase once() function, to read a specific user ID only the once. If the user is successfully found, then line 31 does an insert() call to save it to the hn_users table in our local database. Notice the {conflict: "update"} qualifier in this command. This basically means that if the record doesn’t exist in the database, then create it, but if the user ID already exists, then update the existing record instead of throwing an error. It is essentially a “insert or update if exists” command, which makes things so easy to manage and one of the reasons I have grown to love RethinkDB.

The rest of this function is essentially housekeeping that outputs a console.log message with the user ID that has been added or updated.

Lines 46 to 67 is the pullItem() function which does exactly the same as the pullUser() function, but for Hacker News articles and comments. Note: Articles and Comments are stored in the same table in HN).

You may be thinking at this point — How come we aren’t saving or mapping individual fields when saving to our local database? Well, that is actually really simple to answer. Firebase returns data as JSON structures, and RethinkDB, being a NoSQL system, expects data to be sent to it as JSON structures.

So there is really no extra manipulation to be done. We are simply handballing the JSON data that Firebase returns to us straight to RethinkDB. All fields and their values are sent across ‘as is’.

Lines 70 to 82 are the busiest lines in the whole app. This is the function that waits for the Updates event to be pushed from Firebase. The on("value" ...) function is basically a function that wait for a change of values event from Firebase, then runs.

This function checks the incoming Updates feed for two arrays. The items[] array contains a list of article and comment IDs that have changed, and the users[] array contains a list of user IDs that have changed. We simply cycle through these arrays and then call pullItem() and pullUser() to read the individual IDs and import them to our local database.

Lines 87 to 93 is similar, but is an on("value" ...) function which listens for an array of new stories that are being pushed out from Firebase. All we do here is grab the array of new story IDs and save them to the hn_lists database for later use.

Important: You need to set up a separate listener function for each feed that you want to save data from. Basically, for every reference that you set up earlier in the app (lines 17 to 23), you need to create an on("value ...) function here against that reference to read and process the data.

That is IT! A short application of only 100 lines or so, but it does a lot. Lets run it. Save the file and back at the command line, type in:

node feeder.js

You should see console messages displaying the user, article and comments being transferred over.

If you still have your RethinkDB console window open in your browser, you should see the activity graph spike every 15 to 20 seconds as a stream of data is read and saved after being pushed from Firebase.

Tip: You can now set up your Node.js app to run as a service so that it will auto start and run even if your server is rebooted. The details on setting up PM2 on Node.js to run as a service is in this detailed Digital Ocean guide.

Once you have done this, you can essentially ‘fire and forget’ your feed refresh server. Caution: Over time your RethinkDB will fill up, so please ensure your virtual server has enough disk space. You may want to point your RethinkDB data to a separate block storage device. I found that my database increases by at least 100–150MB per day. That is 1000MB of data every 10 days or so!

Check The Data

You might be a bit skeptical that everything is working as it should be, and I don’t blame you. This all seemed to easy, didn’t it? ;)

Well, there is an easy way to check what is happening, and this is via the Data Explorer tab in the RethinkDB console again. You can run simple ReQL queries to check the tables. For example, to read the first 40 or so records from the feeds table, you can run the following:

r.db('hn_data').table('hn_feed')

which should show you something like the following:

You can even do fancy queries, like return the last 10 articles by reverse date order by querying:

r.db('hn_data').table('hn_feed').orderBy({index: r.desc('time')}).limit(10)

(Note: For this to work effectively, you have to create a secondary index in the time column in RethinkDB). You can create a secondary index directly in ReQL via this command:

r.db('hn_data').table('hn_feed').indexCreate('time')

Please be patient — it can take a few minutes to fully index the table (you can check the reindexing progress from the dashboard), but after that, the above sorted ReQL query should work.

Conclusion

That is it! You now have a fully working RethinkDB server which is busy replicating the data from Hacker News in semi real time. The next article I will posts will discuss what we actually DO with this data, i.e. display it in a real time web page using Vue.js as the front end framework. Catch you then!