Eric Wang is a co-founder at the Archon Cloud, a blockchain-based file storage system, where he leads research and other efforts.
TL;DR: This article describes the emergence and potential impact of file storage systems based on blockchain technology. Blockchain-based File Storage (BFS) is a promising alternative to both centralized storage and non-incentivized P2P file systems. If BFS can overcome the many usability and technological challenges it faces, it can potentially become the de facto storage infrastructure, catalyzing a decentralized internet.
There is recent focus on decentralized systems because they promise new classes of applications free from intermediaries, create new economies of scale and offer unprecedented user control of data. Smart contract functionality offers users the ability to create applications that grant these advantages. With this tool in hand, people increasingly romanticized a new internet known as the decentralized internet (or web3), where apps will be even more powerful and spectacular than they are now. Decentralized applications would be developed over an infrastructure of economically secured blockchain systems.
It soon became apparent that, as “decentralized operating systems”, blockchains cannot themselves handle applications more resource-hungry than Cryptokitties without improving many things. These events created a public frenzy to innovate and improve on decentralized infrastructure, so that useful decentralized applications could be produced.
For example, the blockchain by itself is a very poor storage apparatus. One of the important realizations was that one cannot store files economically on-chain [1]. Indeed, a ledger shared by thousands of users, where each piece of data is replicated amongst each user is not practically capable of hosting data beyond the megabyte scale.
Therefore, many important developments within decentralized systems should and do happen outside the blockchain — certain layer 2 solutions, private P2P networking solutions, storing files etc. The actual blockchain and other separate components sum up to form the decentralized internet.
The web3 stack
Although decentralized systems are potentially revolutionary, not many people will join the decentralized internet if the technology sucks, which means that the decentralized tech stack can’t be that much worse (barring the interesting property of decentralization) than existing solutions.
Currently, no part of the stack is complete, with the DNS, storage and computing layer development being especially early. A typical mantra is “we will build a DApp using a blockchain to take advantage of blockchain immutability etc., and even if we have to store our files in some centralized location it’s OK” [2].
I argue that though it’s possible to build an “application” on top of blockchain that receives solid adoption, these applications aren’t “decentralized applications” without their data being loaded and exchanged in a trustless, decentralized manner. In other words, these are interesting applications on other merits, but not necessarily because they’re truly decentralized.
This brings me to my core thesis:
Blockchain-based File Storage -> Decentralized Data -> Decentralized Internet
BFS will be the candidate to serve as a backbone to web3. I will reinforce my claim that BFS catalyzes a decentralized internet through decentralizing data. Because without truly decentralized data, there aren’t true Dapps, and without true Dapps there is no decentralized internet.
A secure, fair, economically-driven BFS has many potential benefits over other storage solutions, both in general and for web3 infrastructure. I will also analyze substantial technological and business barriers towards achieving mass spread popularity for BFS.
In the 1990s, files were stored in and served from local servers and people would simply request data directly from other distinct computers. Although people were able to fully control and serve files on their own, setting up these servers well required extensive experience in networking, cybersecurity, as well as, time.
This initially didn’t seem to be much of a problem, as the entire Internet was only about 1.5TB by 1997 and the combined value of the Internet was not nearly as large as it is today. A popular belief at the time can be summarized well by a well known computer scientist: “There may be a few thousand petabytes of information all told; and the production of tape and disk will reach that level by the year 2000” [3].
The paradigm changed over the following years, as computer-to-computer interactions generated far more data than humans themselves could and much of this data was very useful for consumers as well as enterprises who found new ways to generate results from data such as in artificial intelligence, ultra high definition video and financial modeling.
The demand from users to store, handle and analyze this data became increasingly important and difficult to handle on their own. As of 2018, there is now ~32 total ZB (that’s 32,000,000,000 TB) of data generated annually.
Here’s where Amazon comes in. As Amazon began establish itself as the major player in the e-Commerce business, they realized that they had developed a huge set of in-house APIs and infrastructure to handle a huge amounts of data for their business. Soon, Amazon teams built an entire internal collection of software to help their various divisions save months of time from not needing to worry about infrastructure [4]. In 2006, Amazon S3 and EC2 were unveiled, heralding the age of the centralized cloud.
The cloud gave users easy access to Amazon’s powerful collection of tools and began to dominate the storage needs of many businesses due to its simplicity and power. Amazon (followed by Microsoft, Google and “private” clouds like Facebook) gained increased control of the data on the internet.
Individual users lost sovereignty and the ability to control their own data. With ~1 point of failure, various large data holders became victim of increased data breaches, data loss and high profile server downtimes cost businesses billions of dollars in lost revenue, as well as lost human knowledge and liberty. This, coupled with an increased understanding that large centralized cloud holders might be able to violate personal and business privacy, sparked an ideological and practical movement against data centralization.
As a component of the decentralized internet, then, Amazon S3 is as an enticing choice at first glance due to its simplicity and power. Numerous DApps currently use such services to bootstrap their products, promising decentralized data in the future, or they may argue that one can have a decentralized application without decentralized data.
However, I argue that data centralization is a fundamentally crippling problem that renders applications or infrastructure hosted on the centralized cloud (~1 point of failure) to be not decentralized. The internet is simply a bunch of computers transferring files to each other. Centralized storage enables centralized data, which enables a centralized internet.
In corollary, good decentralized storage ensures the legacy of the decentralized web. Even if many other components of the web3 stack are attacked, data can always eventually be retrieved/communicated later in a trustless manner. Even the public chains themselves are not that decentralized without decentralized storage, since it is very likely that the majority of blockchain full nodes are hosted on very few centralized cloud storage solutions.
Decentralized storage enables decentralized data.
Peer-To-Peer file storage emerged as an alternative to centralized clouds without the ideological and practical dangers of centralization. Five years before Amazon S3, BitTorrent already paved the way by allowing files to be shared amongst peers efficiently. Peer-to-peer applications eventually accounted for ~50% of internet traffic in 2009. Although BitTorrent allowed people to share files with one another, it was not suited to store and discover these files in the same way as Amazon S3 or Dropbox would allow you to; it does not function well as a file storage solution.
IPFS builds on the milestones of BitTorrent to develop a real peer-to-peer, decentralized file storage system. In IPFS, all files are united within one swarm, where there is a common language in a united file system, and all peers are shared within the entire system which allows people to discover and transfer files to each other.
Organizations such as the Internet Archive and many DApps began to experiment and use IPFS to store their files to advertise more of their stack as decentralized. For many initial experimental use cases, IPFS was more than sufficient.
Because IPFS unites peers in one system (where peers can discover each other through what’s called a distributed hash table (DHT)), there is a common language of communication through the IPFS protocol and there is no central point of failure, IPFS is a truly decentralized candidate for the storage backbone of a new decentralized internet. Surely enough, many well known DApps such as OpenBazaar and Augur use IPFS.
There, unfortunately, are fundamental problems that prevent IPFS from expanding significantly beyond the realms of community projects and open source enthusiasts. Here are the most pressing ones:
Many projects have patched the first issue by wrapping files in IPFS nodes hosted by centralized Amazon S3 computers. This means that there are some Amazon S3 nodes that you will host yourself over time to guarantee the existence of your file on IPFS (as long as Amazon itself works as intended). However, there once again is the problem of centralization, defeating the purpose of using IPFS in the first place. To make decentralized data good, what we need is a way to take inspiration from such systems but add an incentive layer as well as stronger security guarantees to make a decentralized data feasible at the same scale as centralized data.
Public blockchains use cryptographic incentives and punishments to steer untrusted user behavior towards a desired consensus. Therefore, an ideal BFS with a strong cryptographic incentive system underpinned by other parts of a decentralized tech stack, such as a secure alternative to DHTs, seems to solve both problems that prevent IPFS from being the de-facto decentralized internet infrastructure.
For a decentralized internet, an ideal file storage solution is equal to if not better than a centralized one, but also decentralized.
In a good scenario, each storage provider in the network has large amounts of storage space to provide and their storage and bandwidth can be efficiently cryptographically guaranteed. BFS enabled innovations and completely new uses of technologies such as erasure coding, proofs of storage and proofs of space. There are many innovative players coming into the scene and dozens of projects with various approaches to innovate both on the technology and product fronts.
The collection of individuals and participating professional storage providers adhering to the rules of a blockchain can eclipse the reach and power of any centralized company, even a behemoth like Amazon. Beyond just data exchange without intermediaries, blockchain-based solutions introduce many advantages:
With cryptographic incentives, nodes will be economically punished if they do not store and serve data. Corporations and even governments will find it hard pressed to take down files when they have such high fault tolerances. Because of decentralization, there is no middleman (like Google/AWS) who manages the data on your behalf.
2. Are very resistant to crippling black swan events and file downtime
By storing shards of the file, either through traditional slicing or erasure codes, there are many hosts who hold a file. If there are enough nodes then natural disasters, human/computer error, and coordinated attacks should be very difficult to affect the system.
3. Have potential performance benefits over centralized systems
Because many nodes store smaller parts of the file, downloading a file can be parallelized. Like in BitTorrent, where parallelized downloads can be many times faster than normal downloads from a centralized cloud.
4. Are likely very cheap and creates new economies.
Storage and data are considered by many to be a commodity, and large amounts of hard disk space is empty. Storage providers can monetize their hard drive assets for useful purposes. Because the cost to store files is very little beyond the initial purchase cost of the hard drive itself, there is very little overhead costs to host files — storage providers are essentially generating pure profit. Existing solutions have demonstrated the wild cost savings to the user: Sia costs approx. <$2/TB/month compared to S3, which costs $23/TB/month with its standard offering.
For a decentralized internet, an ideal file storage solution is equal to if not better than a centralized one, but is also decentralized. BFS stands to have performance users expect from centralized entities, and the decentralization people adored from IPFS. The main problems that centralized solutions had all stemmed from the fact that they’re centralized. In other words, an ideal BFS is an ideal file storage solution; it should move the substantial mass of people who care about decentralization away from centralized solutions towards using decentralized data since there is relatively little sacrifice to be made in the migration.
Previous sections have built an argument regarding the potential benefits that BFS might have over IPFS and centralized solutions. In reality, the two most popular production-level storage projects combined literally store thousands of times fewer petabytes of data in 2018 than large cloud providers did in 2016, and the total stored capacity in cloud storage is expected to grow significantly over the coming years. After discussing with many individuals within blockchain as well as traditional enterprises and businesses, I reached the conclusion that there is much to do before a decentralized internet can put centralized solutions out of business. Amazon S3 and others have features and optimizations, as well as usability that currently cannot be matched by blockchain-based solutions or IPFS. We will examine both technology and usability issues.
Blockchain-based File Storage systems are still in their youth
Amazon S3 currently has significant advantages in both upload/download performance as well as having a much wider feature set.
As per upload, decentralized solutions might always have some inefficiencies compared to centralized solutions. Generally, uploads are handled through decentralized marketplaces marketplaces, where storage providers and storage “buyers” need to be matched in some manner. This provider/buyer matching and communication process, as well as the slower speed of individual nodes compared to enterprise-level centralized computers, is the upload bottleneck. To upload files to a particular person, one often needs long initialization time (upload latency) the first time he/she uploads a file to that person (in the form of some sort of storage contract verified posted on-chain); alternatively, one might first post and allow a file upload transaction (where buyers and sellers are matched) to be included in a verified block per file, which can take a few seconds to minutes.
Some solutions such as parallelized uploading, where different shards or pieces of files can be uploaded to different nodes at once to maximize connection bandwidth, long term buyer-provider contracts, batched off-chain buyer/seller matching and storage negotiations that can be reclaimed later on-chain (layer 2 solutions) and faster consensus/more efficient block propagation are underway.
Blockchain scalability is also an issue preventing blockchain systems from increasing its capacity. A back of the envelope calculation shows that if each Tx specifies a 50MB file to be stored and there are 25 storage transactions per block, which is generated every 30 seconds, then the total system can store ~1.3Exabytes of data annually, which pales in comparison to what large cloud providers are currently storing. In addition, there are many other bottlenecks like slow Proofs of Storage that would prevent the system from achieving the calculated maximum capacity. Layer 2 and other scalability solutions can help with the problem, but cryptographic proofs need to also be more efficient.
Downloads speeds also suffer similar problems as uploads, where download speed and latency issues are caused by buyer/provider matching and communication, as well as the speed of individual nodes are called into question. Downloaders can pre-pay for downloads (Sia, Storj), or pay for them on demand, per-download (Filecoin). The per-download method utilizes buyer-seller matching and payments each time, which can take longer than centralized solutions even if the process was done off-chain. Solutions to these problems are similar to solutions for uploads.
Finally, there is the wealth of features that blockchain-based solutions do not have.
For example, currently in all solutions I’ve seen, every downloader must be a registered user within the blockchain and generally has to own tokens, whereas in the centralized cloud, everybody can browse content on their browser or app without any knowledge that they are using the cloud (this is also a usability issue, of course). Current solutions give users the ability to encrypt a file client side but since transactions are public, onlookers can see that pseudonymous addresses are sending hashes of particular files to each other. This can be very tricky for certain companies like genomics companies who might not want others to know whom they are sending data to and any public info about the data (not even the hash).
Also, it is incredibly difficult to design efficient proofs that storage providers have the files they’re supposed to store (Proofs of Storage), or that they are uploading files as they should (Proofs of Upload). In addition, professional services that enterprises have come to expect like guaranteed. Service-Level Agreements (SLAs) and file permissioning (who can view which file) are difficult to implement. Most features are at a very early stage of development (we are ~20% of the way there before feature sets are comparable to centralized).
2. Usability
Usability in the context of BFS and blockchains themselves are a larger barrier of entry than any other issue.
The lack of integration into blockchains and additional payment choices are big problems. BFS are generally siloed away from the DApp user’s public chain of choice. For example, Filecoin, 0Chain, Sia have their own blockchains. A DApp user does not want to navigate the complex rules of a completely new chain and learn about staking and behaviors etc. to upload a file. Cross-Chain integration and cross-chain payments can be used to make this much easier. For example, a NEO Dapp user (who probably has some NEO and gas tokens) on the NEO blockchain can pay Gas tokens to upload a file through an easy to plugin upload API, which works fine whether the BFS is native to NEO itself or simply has some oracle or cross-chain integration to NEO. Ideally, all of the dynamics of token payments are made to be as intuitive as possible.
Secondly, the generally bad user experience when trying to use a file comes into play. For example, whether you’re an uploader or a downloader, in Filecoin and Sia you must currently download an entire blockchain, which takes hours. Then, you generally have to have an account on an exchange and be able to understand cryptocurrencies and wallets to just view a file. This is in stark contrast to Amazon S3, where you can manage all your uploads through a web interface, and all downloads are so abstracted away from the end user that they don’t even realize where the file is coming from on their browser until Amazon crashes and takes down massive parts of facebook and other parts of the internet with them. Clearly, solutions allowing for more abstraction of user experience away from cryptocurrencies and blockchain are needed to make the user experience comparable to a centralized cloud. One solution is to put the complexities of crypto and payments all on the uploader’s side, so that the downloader can use a simple JS module on their browser view decentralized data without installation.
Finally, the lack of tools to migrate data from centralized cloud solutions encumbers the switching process.
The decentralized internet promises types of data exchange without intermediaries. This enables new uses of web applications that improve on existing applications, or completely new applications that were never before possible. As Olaf Carlson-Wee of Polychain Capital said, “I think we’re going to be comparing, or port rather, web 2 things on to this web 3 or decentralize user-owned web, but that overtime, we’ll find those web 3 native applications, those are the things I really care about even though right now they do sound somewhat sci-fi and I think it’s very unclear what they look like” [5]. Having a tool to store and share these pieces of data, whether it’s archived blockchains, front end data, metadata or large multimedia files is crucial to this type of data exchange and to the very idea of decentralization. Though BFS systems are works in progress, let us lavish on them the attention they deserve and build on them to make them the products they can be. Blockchain-based file storage is no longer just a buzzword, but represents a large piece of a solution to a problem many in society desperately wants solved, IF it is also fairly convenient to do so. This convenience comes in the form of a future decentralized solution as powerful and easy to use as centralized ones.
Blockchain-based File Storage catalyzes adoption of the decentralized internet
Decentralized data makes the decentralized internet. As mentioned previously, the internet is simply a connection of computers that store and transfer data to each other through some sort of communication protocol. Decentralized data is then the trustless, decentralized way of storing and sharing the data. As of this writing, there are 32 million blockchain wallets; these are millions of users already taking advantage of decentralized communication protocols (gossip protocol, Tor etc.) and trustless ways to verify data (blockchains themselves). However, there are barely any truly working+adopted+decentralized apps. A powerful storage layer is the missing component.
Whether the decentralized internet will completely replace the centralized internet depends on whether adoption of BFS can cause an exodus from centralized services (figure above, B), and the results remain to be seen. Since increased participation in decentralized storage is in essence increased participation in a new decentralized internet, in the process of trying to solve this problem, we may have inadvertently created a storage system that is simply better than existing solutions for society’s evolving needs, resulting in a virtuous cycle (figure above, A) that allows even more people to join the decentralized internet. Nonetheless, there are many problems that remain to be solved before the virtuous cycle can occur, which is what projects like the Archon Cloud is focused on. I hope this article serves as a solid introduction to Blockchain-based File Storage and its importance. Stay tuned for future articles that will describe specific BFS solutions people like us are working on.
Thanks to Sam Suh from Archon for helping me with my writing style and editing the document and Andrew Lee of Web3Journal for reviewing my work.
Disclaimer: I am a co-founder at a Blockchain-based File Storage company.
Sources:
[1] https://medium.com/@didil/off-chain-data-storage-ethereum-ipfs-570e030432cf
[2] https://blog.wavesplatform.com/web3-0-the-road-ahead-for-waves-9bd8a51f63ce
[3] https://courses.cs.washington.edu/courses/cse590s/03au/lesk.pdf
[4] https://techcrunch.com/2016/07/02/andy-jassys-brief-history-of-the-genesis-of-aws/