Tales From The Blockchain: Our Mainnet Stalled; This is How We Rescued It

Written by nodle | Published 2020/11/01
Tech Story Tags: bluetooth | iot | cloud | hackernoon-top-story | good-company | blockchain | nodle-blockchain-crash | blockchain-top-story

TLDR Nodle Global IoT Wireless Network's main network stalled on October 19, 2020. Issue was due to a generalized issue on our infrastructure located on Google Cloud. After restarting the validators, block production was still not moving, and we encountered consensus errors. We decided to name this patched version of our node “Phoenix,” in reference to the process of destroying the previous network — to then restarting it using what was left from the previous one. The network was now running smoothly on top of the old data.via the TL;DR App

Early morning CEST on Monday, October 19, 2020, we detected an issue on Nodle’s main network. It appeared that all our validators were offline and block production had stopped.
This seemed to be due to a generalized issue on our infrastructure located on Google Cloud. After restarting the validators, block production was still not moving, and we encountered consensus errors. This story covers what happened and how we solved it.
Just like a Phoenix, our chain had to come back from its own ashes 🔥

First, There is a Consensus

A typical production chain built with Parity Technologies’s Substrate Framework produces and finalizes blocks by combining two consensus algorithms: babe and grandpa. Babe produces blocks that grandpa finalizes. In our case, block production had halted, so the issue was related to babe.
Understanding Babe
For the non-initiated, babe produces blocks by epochs. It expects at least one block to be produced during every epoch, but usually, more of them are produced. These epochs depend on a universal parameter we will all know and use: time.
In our case, since all our validators had been stopped for too long, we had skipped one or more epochs, and thus had broken the assumptions of the consensus algorithm.

Then, From the Ashes, the Chain Returned

We were in fact able to find one prior occurrence on the Kusama Network of a similar issue. It happened in January 2020 and was detailed in this blog post by Gavin Wood. The option Parity had chosen was to revert the chain by a few blocks and create some sort of ‘time machine’ on the servers.
Unfortunately, though, we were unable to exactly understand the requirements and the best ways of doing so.
Another solution could have been to revert the chain by a few blocks and configure all the validators’ clocks to make them think they were running in the past. This was possible in our case since the Nodle Chain still runs as a ‘Proof Of Authority’ network of which we control the nodes. However, we deemed this option as impractical as we were still not able to understand all the requirements to do so. Further, the probability of the chain restarting after these efforts wasn’t thought to be 100%.
After talking to a few people from Parity and other teams from the Substrate Builder Program, we came up with an alternative and simpler plan. We would fork the network. This involved the following:
  1. We would write some runtime migration code to make sure that any scheduled item was updated to correct block numbers (for instance, we had to recompute some vesting grants schedules).
  2. We would duplicate the state of the main network and create a new chain spec for the new network.
  3. We would test the new chain spec to make sure it was identical to the stalled network’s state.
  4. We would issue an upgrade to the Nodle Node software to include the chain spec for the new network.
  5. We would stop our validators and restart them on the forked network.

Finally—Phoenix is (Re)born

This is exactly what we did. We duplicated our chain’s state by using a script called fork-off-substrate (which we had to slightly modify). We pushed a few pull requests and docker containers. We also kept a branch with our working changes for reference.
We then turned off and back on our validators with the updated node while keeping the old chain data on some servers (in the event we need to check it again). The network was now running smoothly on top of the old data… We simply had to register the new validators again and we were done.
We decided to name this patched version of our node “Phoenix,” in reference to the process of destroying the previous network — to then restarting it using what was left behind from the previous one.

What Does This Mean for Nodle Cash Holders?

Since we duplicated the previous chain’s data, all balances and transactions have been preserved. We made sure they are not affected by the changes. If anything, Nodle Cash holders should keep an eye for new updates to the wallet software that make it even more stable, by using reinforced nodes.

What Does This Mean for Nodle Chain Node Operators?

Nodle node operators will need to update their nodes to the latest version available on GitHub so that they can synchronize the new fork. Command-line options have been preserved and do not need to be changed; this should be a simple update for most of them. One thing to keep in mind is that the network ID changed, and thus the data will be stored in a new subfolder with the new ID as a name.
Follow us for news, updates, and questions you may have on Twitter @NodleNetwork and @nodlecash, and join the Nodle Community on Telegram.

Written by nodle | Nodle connects the physical world to Web3 by using smartphones as edge nodes.
Published by HackerNoon on 2020/11/01