Lying to the Blockchain: Applying The Garbage In, Garbage Out Problem to Decentralized Networks

In this article, we address a notion that is often overlooked (mostly, intentionally) of how real-world data interacts with blockchains.

Like any other system, blockchain suffers from the classic “garbage in, garbage out” problem. Blockchain infrastructures cannot make any veracity guarantees for data that was not natively generated on-chain and not publicly available, which unfortunately makes up for the vast majority of the data in this world. Hence, if a person (or a device) commits fraudulent data into the blockchain, there’s no way to ascertain the veracity of this data, and you’d end up with fraudulent data permanently committed to the blockchain’s history. If you put garbage into the blockchain, you get garbage out of the blockchain.

Purported applications that ignore this problem run rampant today, often with additional layers of technologies to give the facade of correctness. Here are a few examples:

Decentralized data markets: where companies are incentivized with tokens to put their data up for sale. But how do you know the data being bought is real?

Privacy-preserving queries: a service where the number of high net worth individuals for a bank could be counted via a zero-knowledge range proof so you’d get a count without the bank giving you any of its customer’s information — how do you know the bank isn’t fabricating its entire customer database?

For publicly available data, you can design a game whereby players with financial stakes at risk challenge one another on the veracity of the data provided, as Chainlink and some other blockchain projects do. But again - the vast majority of the world’s data are not publicly available.

So what can be done to address this? The key is to secure the data at the source.

Securing the source: A Very Practical Approach.

If data was acquired not at the source but through any third-party intermediary, the data’s veracity can no longer be trusted without also trusting the intermediary. The more intermediaries are involved in handling the data, the more you’d have to trust until at some point so many intermediaries are involved the data might as well be generated from a random number generator.

The goal is then to capture the data as close to the source as possible. For example, instead of obtaining sales data from a retailer’s database, get it at the point of sale hardware. Instead of subscribing to a feed from a weather website, get it from weather sensors that collected the data. Instead of reading a PDF report from a bridge operations company, try to get raw data from video cameras and sensors installed on the bridge.

But how do you secure data at the source? Since most data in this world are either generated or captured by devices, let’s describe how to secure device-generated data. Here we face three potential points of failure:

Identity: how do you know what is generating the data? Is it from a temperature sensor like you expected or a random number generator from a malicious player?

Processing & Transmission: even if the data source is real and identifiable, how do you know if the data wasn’t altered, corrupted, or just outright switched during processing and transmission on the device — e.g., while moving from the sensor into the communication module?

Digital / Analog Interface: even if identity, processing, and transmission are secured, how do you prevent someone from altering the way the device collects data by physically feeding it a fake input signal?

Let’s tackle these ones by one and see what can be done.

Identity

To ensure that a data-generating device’s identity is protected, a set of public/private keys could be embedded into the device and making the public key known plus making available onsite inspections of the actual hardware’s output are practical and practiced ways to ensure that the hardware is what it says it is. But that’s the easy part.

The tricky part is how do you make sure that this identity cannot be stolen and is known only to the device? You can use something called a secure element (SE), which is a piece of hardware that can generate public/private key pairs within the chip and is highly tamper-resistant. A SE typically just does one thing: to sign messages, which is a fancy way of saying to provide proof of identity. If you’ve ever owned a credit card or a modern smartphone, you’ve benefited from the functionalities of a secure element.

Processing & Transmission

To protect that the data processing & transmission logic is secure, we make use of a microcontroller (MCU) with secure boot (SB). You can think of a microcontroller as a very simple computer. SB ensures that only an entity with the right private key is able to load applications into the MCU. The application logic and associated checksums could be shared ahead of time with relevant stakeholders or simply open-sourced so they could be verified post-loading.

What’s more critical next is that once the application has been thoroughly tested, we need to disable all modification functionalities from the application and the MCU, including firmware upgrades. This is to ensure that the application logic is now absolutely immutable, not even changeable by the manufacturer at this point.

This creates obvious disadvantages, such as the fact that the application can no longer be upgraded. But in return, we have gained true device independence (in conjunction with the SE) from outside interference, with perfectly deterministic and unalterable behavior that could be trusted.

Digital / Analog Interface

This problem is tricky, and cannot be solved using hardware embedded on the data collection and relay device. Often creative mechanisms must be devised to ensure that the interface is not disrupted, but it is highly application-specific. Let’s use an example.

Suppose you have a refrigeration truck that’s part of a fleet from a cold chain logistics company, tasked with delivering fresh fish to local supermarkets. To ensure quality, the fish must remain within a certain temperature range. If the temperature is too high, the fish could spoil. If the temperature is too low, the fish could end up with inferior taste and texture. To ensure that the logistics company adheres to the contractual temperature range, the supermarket puts a temperature sensor in the truck.

But what if, the truck driver takes the sensor and puts it inside an ice cooler in front of the truck, while he dials up the temperature in the refrigeration unit to save energy costs? The sensor has no idea it has been moved and keeps collecting and reporting data that are within the contractually-agreed upon temperature range. The sensor has been duped.

One way to mitigate this risk is to hardwire the sensor into the refrigeration unit so it is nearly impossible to remove. But maybe this method could still be circumvented by say, wrapping a bag of ice around the sensor while keeping the rest of the truck above the contractual temperature range.

Another, potentially better (but far more expensive) way is to put a tamper-resistant seal on each package of fish, with separate a temperature sensor in every package. So if the driver tries take out the temperature sensor, they would need to break the seal, something that’s easily detectable and likely to break key terms of the contract.

And lastly, to resolve the problem of the digital/analog interface takes a lot of creativity, and the solutions tend to be highly application-specific.

The bottom line? When presenting a blockchain solution to enterprise clients, it's vital to remember that blockchain alone is useless when applied to IoT, if there are broader issues of device independence. We need to ensure that the data generated from these edge devices can be trusted and completely free from outside influence.