Even though the bear market is quite strong lately and there may not be much interest in the space, this is still arguably a great time to build new and cool things that may or may not be useful in the future.
This summer we were contracted by a client who wanted to build an Ethereum wallet for Android and iOS that would enable their users to safely store and send crypto-currency. An integral part of the application was the ability view a history of transactions. As it turns out that is not a simple thing to do without third-parties since there the web3 does not allow querying an account for it’s past transactions. We have explored an option to build our own ethereum indexer but after taking into consideration time and budget restrictions we opted for Etherscan.io set of free APIs to fulfill the task at hand.
But as you can suspect from the title of this article I’ve decided to build my own explorer, that would hopefully, in the future, solve some of the problems we had with Etherscan APIs, and well, to learn a thing or two. First I’ll explain the tools I used, the infrastructure that is needed to kickoff and what would it take to scale should the APIs get extremely popular. I won’t be getting into code as I believe that most of you are not here for step by step tutorial.
To kick things off I needed to get access to the Ethereum node so that I can fetch the transaction data which would be processed and stored in a regular SQL database server. By saving the data into a good old SQL server I was be able to perform various queries on the data — mainly it provided the ability to filter past transactions and display results in a way that is useful for projects like wallets or explorers.
I could scrape data from one of the freely available Infura nodes and this is exactly what I did in my first attempt. But looking at the console output and timing the results, the Infura node seemed to be quite slow. I can’t really complain — it’s free to use, but after calculating the time it would take me to process all the blocks I decided to setup my own Parity node and see if it improved things. It did. It speed up the whole process dramatically so it was well worth investing time into getting my own node. You don’t need much to setup your own node — just make sure you have a machine with enough SSD storage and you’re good to go. I’m currently running it on a cloud node with 160GB disk — it’s slowly running out of disk space so I’d recommend that you get yourself a server with at least 200 GB if you don’t want to run out of space in the next year.
With the Parity node ready and firewall rules set — I’d recommend you protect your node, it was time to write the code that would query the blocks and process the data. I choose to work with Python — specifically with Django since it’s what I use daily and I wanted to do it fast. The whole application is packaged and ran inside docker containers so that scaling the project shouldn’t be much of an issue — I’ll talk about scaling more in depth a bit later.
The very fist step I actually did was plan how to represent the data in Django. At first glance it seems quite simple — we have a Transaction that contains some data and that’s it. Looking a bit more into the actual data stored within a Ethereum transaction you’ll quickly notice that contract calls output also logs that contain ERC20 token transfers as well as other logs.
The main issue, if I can even call it an issue, with the Etherscan approach is that in order to display a transaction history for a specific wallet one has to make 3 requests:
You can see this on their website as well — each wallet has 3 tabs and you have to inspect all three of them in order to see what was really going on with the wallet.
So what I did was model a database that allows me to do just this. The database contains the following Django models:
Looking back I could also add Account model that would represent an Ethereum wallet account or Contract account, but for my needs this was more than enough. The above structure would allow me to quickly extend the list of known operations by simply adding a proper processor for it and iterating over the Transaction objects that have already been stored. This way I wouldn’t need to re-query the Parity node again as that is much slower than simply querying the existing SQL database.
The code that is needed to process the blockchain transactions is pretty straight forward. It consists of the follwing steps:
The fetching, processing, storing and extracting operations was all done within 250 lines of code (do not get fooled by the number, still takes quite some time to actually do it even with experience). Because this is a long running process I’ve split mine into small tasks that could execute in parallel. Specifically I’ve used Celery with RabbitMQ and ran the processing in about 10–20 workers. On average I was able to process around 200 messages per second.
Now since the stored data in the Postgres would probably take quite a lot of space I needed to calculate exactly how much, because I didn’t want to start the process just to run out of disk space and then have to go through the trouble of dealing with that. After running my scrapper locally and analyzing the disk usage I’ve been able to calculate that for all of the 350M+ transactions I would need well over 500 GB of disk space. Whoa, that’s quite a lot for a database server right, but it’s not really unmanageable, PostgreSQL can handle that. For this project I’m using a machine that has 2 TB of SSD storage so it’s able to store all that data.
As of the time of this article, December 6th, 2018, the disk usage on the Postgres node is 700 GB.
It does. A service like this is mostly storage heavy and possibly if it becomes highly popular also read heavy. Writes are all being preformed by the scraper process, so we can spin up several database slave instances that can serve read requests. So in that regard a service like this could scale quite nicely.
API service could also be scaled easily by simply spinning up more API containers to handle the traffic and since most of the requests are read requests we can leverage heavily on caching mechanism. The past data will not change so we can optimize quite well in that regard.
What about the costs? Currently I’m running the whole indexer on 3 hosted machines that cost me around 200$ per month, the database node (one with 2 TB ssd storage) being the most expensive one. So generally it’s not *that* expensive to keep it running.
At the end what I built was an API service that enables me to get a history of a wallet in a structured JSON format with a single API call. Additionally I can query a wallet for it’s ERC20 token balances that it has and trigger webhooks when someone sends tokens to or from a specific address. Theese could be used for push notifications or any other service integration which is actually quite a powerful feature on it’s own.
To get a bit of perspective how long would you need if you decide to run similar experiment on your own — it took me over a month of continuous scraping to process all the data from block 1 to the latest block.
PS: I like building stuff — not just in blockchain space, I mainly work on custom business software, and I plan (hope) to write more about it in the future. If you’d like to find out more about future projects make sure to follow me here ;)
Create your free account to unlock your custom reading experience.