Node monitoring is more than one single thing.
It is both a strategy and a philosophy, and not just a bunch of fancy technology.
If you give a guarantee that the services you provide are stable and reliable, it isn’t enough to be able to know when things aren’t working; you need to be able catch them before they stop working.
Certainly, part of that is designing stable and reliable systems, but without monitoring to go with it, there is no way of guaranteeing that your systems are robust enough to withstand and weather events out of your control, which are inevitable and unavoidable.
On top of that, you have to make your systems scalable so you can double, triple, quadruple (or more!) the number of nodes you are running at any given time.
You have to make your architecture extensible so you can readily onboard other blockchain protocols and technologies.
This can’t just be done with high tech telemetry systems and flashy dashboards; you need a coherent strategy to go with it.
One of the hardest parts of designing a monitoring system is predicting everything that “might” go wrong. That only comes from experience, not just over time, but over a very broad range of platforms and technologies.
When it comes to implementation, it gets even harder; you have to triage a matrix of failure likelihood with impact. What is the most likely source of failure? Is there a catastrophic but very unlikely failure that is arguably more important to detect, or even predict?
These are the sorts of things that are very difficult for an individual to design for and implement on their own. An individual can’t always say they have that breadth of experience; sometimes it’s best left to outsource it to a diverse team who has seen it before, and has the background to inform decisions around extremely challenging tradeoffs that have no convenient solution.
To make matters worse, the developers of the blockchain protocol themselves may not have put in as much effort as required to get even the minimum of monitoring working; they have other things to worry about, and ops (let alone monitoring) generally isn’t at the top of their list.
This means an individual often has to fill in the gaps themselves. Upstream monitoring/ops services may not exist, or may be buggy, incomplete, or outright unusable. The onus is then on the node maintainer to fill in the gaps themselves, file a bug with the upstream blockchain team, or even submit a merge request!
This has its own challenges, not just from a time/expertise/effort perspective, but due to political difficulties in getting a change formally made upstream if you don’t want to be forced into constantly maintaining your own fork of things.
When you work with an established node provider, they likely already have relationships in place with core blockchain developer teams to make those sorts of things happen without having to maintain a local fork.
A great example of this is Blockdaemon’s relationship with the Solana team, who help us improve our monitoring by working with us directly to constantly refine their API. We’ve also built great relationships with other teams as well. Hyperledger Fabric, Mobilecoin, and Harmony are a few examples.
There are things all blockchains have in common, but there are always just as many things (or more) that are completely different.
Even things as simple as disk or CPU utilization may seem completely blockchain independent, but they are not.
They often turn out to be impossible to monitor the same way across different blockchains, since one level of CPU utilization, for example, is expected for one blockchain, but not another. A high load may indicate something is wrong for one blockchain, but a low load might be signs of failure in another.
And just because several blockchains have some protocol related thing in common, they don’t always implement them the same way. Still, it makes sense to try to find a common ground when there is some, and design in a way to unify the way those things “look” from a monitoring perspective so you aren’t constantly reinventing the wheel, even if it takes quite a bit of glue or translation layers to make it happen.
This varies considerably. Some nodes are truly indistinguishable from others, and have literally no sensitive information on them. Even if they are compromised, the damage an attacker can do is no different than compromising any other non-blockchain related machine; particularly if there are no credentials or secrets kept on-node, only the public ledger and maybe some associated public keys.
On the other hand, some nodes might literally have keys to the kingdom on them. Those have to be locked down as best as you can. They may have credentials stored on them that, if compromised, give an attacker direct access to significant cryptocurrency funds, or pretend to be a validator or permissioned participant to collude to cause a fork or even take over a network.
There is also everything in between, where, for example, you might have cold keys stored off-node, and slightly less sensitive keys stored on-node that can be easily rotated out or revoked by taking those cold keys temporarily out of storage and using them to rotate, revoke, or re-sign. Or, perhaps, the keys on a node can only be used to access a small, insignificant cryptocurrency account which is only needed for transaction fees and the like.
In any case, of course, no matter how sensitive the keys on a given node might be, you want to treat security across your entire system with equal attention, since you never know what damage can be done if a node you thought was low risk becomes compromised - for example, if a node only has a list of public “trusted” keys, and no private keys, an attacker can still rewrite that list and convince your node to trust something it shouldn’t.
Worst case, you have actual custody of private customer keys. This is something one should never try to do without serious consideration, since it poses the highest of all risks. In general, an independent node operator should ideally only have custody of their own keys, never customer keys.
Be wary of any organization or individual that would like a copy of your private keys. Keep them to yourself!
From an engineering perspective, it helps you plan for scaling events, optimize resource usage, plan timely upgrades, and find bugs that would be of interest to the upstream developers.
From a customer service perspective, it seeds the knowledge base that your customer service team uses to stay informed, and provides valuable insights that inform all communications with customers.
From a business perspective, it helps you estimate costs, identify trends, make predictions about customer demand and the market in general. It can also provide insight into which blockchains are going to be problematic to make into a viable product because they require constant care and feeding, or ones that can be relied on to provide solid, reliable, predictable revenue with minimal ongoing maintenance effort.
But probably the single most important thing mining your monitoring data (both historical and real time) can provide is ample warning. The best failures are the ones you avoided entirely because you predicted they could happen, and took steps to prevent. The best successes are not happy accidents, but the ones you invested in up front and now pay proportional dividends as predicted.
Absolutely. Proof of Stake has made monitoring considerably more complex; there are many more things that can go wrong with a staked validator compared to a simple non-mining watcher node on a PoW network that is just relaying blocks and submitting transactions. Just keeping a PoS validator “mostly” running isn’t good enough; your revenue is directly related to not only uptime but also raw performance, since every block you miss or fail to sign is money left on the table.
Two main categories: generic instance monitoring, and protocol specific monitoring. CPU usage, disk IO utilization, disk space, and RAM/RSS usage are the bare minimum when monitoring any instance (which applies to all protocols). However alert thresholds may vary based on the type of node you are running.
For example, if you are running an archive node, you are pretty much guaranteed you are going to keep adding disk space in perpetuity, so an alert that tells you roughly “time until full” is much more useful than an alert that says “you have 10G left” or “you’re at 90% disk space”. On the other hand, for a node configured to keep its history pruned, you know the disk usage should stay more or less constant, so you just need to know if it goes over that expected usage, and if it does, likely something went wrong.
For protocol specific monitoring, you want to know if the node is synced or not, which can be tricky because ideally you need a block height (or chain length) reference that isn’t in your own infrastructure, and probably more than one to boot. Monitoring the number of peers connected is also a must.
If you are providing the node as an API endpoint, you need to be watching performance (requests per second, latency and success/failure histograms, etc. which may inform rate limiting or DDoS mitigation parameters). For Ethereum, you need to be watching your Tx queue carefully to make sure things are stuck in them. For an HA cluster setup, you need to set some ground rules for whether or not a node should be considered eligible for requests. Furthermore, since some transaction sequences aren’t atomic, you may need to set up some sort of stickiness to ensure that requests from the same client go to the same node in the HA cluster.
Minimally, you’ll want to monitor your validator’s own internal balance if required for transaction fees for participating in consensus, and stake level to ensure, for example, that it stays elected (if there is a stake minimum etc.) and/or stays in the leader schedule, if that applies.
More importantly, to go back to an earlier question: predicting outages (or pool/staking shortfalls) before they happen becomes even more critical. You can’t afford any down time or APY slippage at all.
If you are an individual node operator, you may need to game an uptime/APY leaderboard to attract (and keep) delegators. Even a 0.1% difference can determine who delegates to you and who doesn’t, and thus impacts baseline return viability. Expenses can be significant, depending on the protocol. Losing half your delegators (or doubling them) can make or break you, and delegators can be fickle!
On the other hand, if you aren’t an independent node operator, you may be operating a large number of nodes, or a few very well staked validators for large clients. Likely they came to you in the first place with the expectation of a certain assured APY from their validators, and any shortfall may fall to you to directly rebate or compensate!
You also may have to constantly rebalance stakes or pool participation to stay competitive or meet an APY target.
Finally, as you say, you may be outright penalized (slashed) for other sorts of mistakes or failures. For this, early detection is a hard requirement.
Not long ago OVH had a large scale fire in one of their DCs (data centers). We caught it almost immediately, and had several backup nodes ready to go, and we were able to fail over quite quickly.
There are also constant day to day issues that crop up - routers fail, quotas impact scaling, hardware fails, disks become full, blockchain developers release flawed updates that have memory leaks, or cause nodes to have trouble keeping up with network traffic.
The list is never ending, and without monitoring, any and all of these issues can erode customer confidence and possibly even end your very business. We have managed to weather quite a few of these events, keep our partners and customers happy, and minimize direct losses (particularly in Proof of Stake networks) from downtime. All of this thanks to extensive monitoring, alerting, and the ability to do very deep post mortems and data mining of historical data.
It is already extremely challenging, and I do not expect it to get any easier. Not just because of blockchain networks’ sheer size (in terms of chain length, ledger size and plain node count), but also their complexity, especially as incentives to participate and decentralize become more complex.
Monitoring isn’t only about mitigating and outright preventing damage. It is also absolutely required to optimize performance (incentivized by proportional returns) to ensure long term economic viability. To succeed and for the blockchain ecosystem to flourish, we’ll all need to stay on top of our game! I truly see this as a cooperative effort, since the health of the blockchain community is strongly tied to the dedication and commitment of its participants to deploying and maintaining diverse and robust systems.
This article was written by Nye Liu, Chief Engineer at Blockdaemon.