Why Chia Plots Don't Contain Real User Data

With the announcement that we are exploring a new plot format for Chia, there are many open questions about exactly what the plots will look like. We’ll discuss the details once we have a proposal ready, but there’s one thing we can say with certainty right now: the new plots will not contain real user data.

At first, this might sound like a strange concept. Isn’t the data stored in plots already “real?” And aren’t some other blockchain projects already storing user data? Does Chia compete with those projects? Why shouldn’t Chia store real user data?

Before we answer these questions, let’s define “real” user data.

Real versus cryptographic data

Most of the data stored on computers is unique or valuable in some way. For example, digital photos have sentimental value – they capture your precious memories. The bytes that compose the photos are not random. This is what we mean when we talk about “real” data.

Plots created with Chia’s current format consist of seven tables of cryptographic data organized using the Proof of Space Construction. Naively, you can think of a plot as storing billions of bingo cards. If you delete a plot, you can create a new one in a few minutes. It will contain different data, but it will be equally valuable. In addition, the plots are only useful in Chia’s Proof of Space and Time (PoST) consensus. They don’t contain “real” data.

User data in blockchains

A handful of blockchain-like networks store user data as part of their consensus. While each project has its own idiosyncrasies, they fall into one of two camps: those that store user data off-chain, and those that store it on-chain.

Off-chain storage

Some popular storage-based blockchains work on a simple subscription model: You pay a “storage provider” to store data on your behalf. Depending on the network and service tier, the storage provider might only store the data locally, or they might add it to a network such as the InterPlanetary File System (IPFS).

Typically, these blockchains require storage providers to put up collateral proportional to the amount of data they agree to store. The provider must prove two things: that they stored the data to begin with, and that they are continuing to store it at a given time. If the provider cannot prove that they are storing the data, their collateral might be slashed (confiscated).

Slashing

This brings up the first obvious issue with storing user data: the threat of being slashed. Hard drives have a maximum lifespan. If a provider’s drive stops functioning, their contractual obligation to store a user’s data will be broken. Depending on the network’s setup, some (or possibly all) of the provider’s collateral will be slashed.

Participants in cryptocurrency networks that use slashing must stay vigilant. Storage providers, in particular, should use enterprise hardware to minimize the likelihood of being slashed. Redundancy also is essential when storing valuable or sensitive data.

In Filecoin (the most popular decentralized storage network), the requirement for redundancy helps to explain the discrepancy between the network’s raw capacity (7.4 EiB) and the amount of data actually being stored (1.8 EiB).

“Random” leader selection

Another reason why Chia doesn’t store user data has to do with the way in which valid blocks are created.

Chia uses a Nakamoto consensus called Proof of Space and Time (PoST) to secure its blockchain. One key aspect of Nakamoto consensus is that users can join and leave the network at any time without obtaining permission. In fact, if a user is solo farming, the network isn’t even aware of this fact until the farmer submits a valid proof to create a block.

Every Nakamoto consensus requires a resource external to the blockchain to generate proofs. In Proof-of-Work systems such as Bitcoin’s, this resource is compute power. In PoST, it is mostly disk storage, along with some compute. In both systems, if a farmer/miner finds a valid proof, it can create a new block. There is no way to determine which node will find the next valid proof, so selecting a block creator is a random process.

By definition, user data cannot be used in a Nakamoto Consensus because the data doesn’t contain cryptographic proofs. As discussed previously, it is possible for providers to prove that they are storing the data they claim to be storing. However, there are two reasons why this data can’t be used to select a block creator. First, the data itself is not random, so it cannot generate a random proof. Second, even if the data were random, the fact that it is encrypted means that the provider could not read it to generate random proofs.

In networks that rely on real user data, it is therefore not possible to use this data for randomized selection of the next block’s creator. Instead, these networks fall back on using a pseudo-random process to select a “leader” to create each block. This is similar to what happens in a Proof-of-Stake consensus.

One of the most popular leader-selection techniques is to use an independent randomness beacon such as Drand. When a node is selected, it first must prove that it is storing the correct data, and then it is allowed to create a new block. The probability of a provider being selected to produce a block is directly proportional to the space they contribute to the network.

Filecoin is organized as a “time constrained series-parallel” Directed Acyclic Graph (DAG). This structure enables the network to select multiple simultaneous leaders, which increases the probability of at least one block being created at a given height, reduces network latency, and makes confirmation times more predictable.

However, the DAG structure also leaves Filecoin’s network open to certain attacks. For example, researchers recently published an article in which they outlined three different block-withholding attacks against the Filecoin network. In these attacks, the adversary manipulates the consensus algorithm to create multiple blocks in a row, and to orphan blocks created by honest miners.

In Nakamoto consensus blockchains such as Bitcoin and Chia, there can only be one block at each height. This eliminates the requirement to introduce the complexities of a DAG, as well as block-withholding attacks. Note that in Bitcoin, a similar attack called “selfish mining” is possible, but it only results in wasted energy. In Chia, selfish mining isn’t even possible because proof submission isn’t a zero-sum game. A proof only needs to be “good enough”; it doesn’t need to be submitted first.

Cloud competition

Another reason Chia chose not to store user data was the intense competition with cloud storage providers such as AWS. Jeff Bezos, one of the wealthiest people on Earth, has made much of his money from Amazon’s enterprise-class storage solutions in the cloud. Competing against providers such as AWS seems like a fool’s errand.

The statistics (Filecoin, AWS) support this thesis. Enterprises that need exabyte-scale storage have thus far gone with AWS or one of its centralized competitors. On the other end of the scale, decentralized storage networks are gaining some traction among users who only need to store small amounts of data. However, the medium-sized providers (those with a few hundred tebibytes) have a difficult time competing with both the enterprise providers (due to economies of scale) and the smallest providers (due to their minimal capital expenditures).

Redundancy difficulties

Finally, redundancy is critical when storing user data. This is a difficult endeavor on decentralized storage networks such as Filecoin, which encrypts the data and “chunks” it (breaks it into pieces). This adds complexity in tracking the number of copies of each chunk that are being stored globally. We believe that relying on pseudonymous providers to store this data on dubious hardware without the security and redundancy guarantees of a centralized cloud storage solution is not viable for most enterprises, even if it is a bit cheaper.

Blockchains have many use cases that could disrupt existing industries. We do not include “a consensus mechanism which relies on storing user data” on this list. This is a case of “sprinkling blockchain” on a manufactured problem.

On-chain storage

On-chain storage networks allow for the permanent storage of user data in exchange for a one-time fee. Permanent storage is accomplished by adding user data to a given network’s database.

The most popular network for permanent storage of user data is Arweave. Its database is around 161 TiB, which has multiple profound implications.

Node requirements

If a blockchain’s database grows faster than a home network’s bandwidth, then – out of necessity – most nodes will live in data centers (see Gene’s recent Trilemma blog post for more info).

Chia’s database grows by roughly 250 MiB per day. A 30 Mbps connection (the average bandwidth of a US home) could download an entire day’s worth of data in 1-2 minutes. This low-bandwidth requirement allows users to run a node from their home with ease.

Arweave’s database is currently growing by about 100 GiB per day. A 30 Mbps connection would require 7.5 hours to download this much data. That’s ⅓ of a home’s total bandwidth consumed just for running an Arweave node. For this reason, most of the network’s 61 nodes are stored in data centers with much higher bandwidths. In addition, Arweave only requires nodes to store a subset of the blocks in order to be considered “synced.” It is unclear how many nodes are storing each block.

Today, Chia’s database is about 1/1000 the size of Arweave’s. And while Chia’s database is growing, SSD capacity and network bandwidth are growing at a faster rate. You will likely be able to run a Chia node at home with low-end hardware for the foreseeable future. The same cannot be said of Arweave.

Security budget

Blockchains with a storage-based consensus derive their security budget from the total amount of data being stored. The more bytes stored, the more difficult it becomes for an adversary to launch a “majority attack” against the network. By this metric, Chia’s security budget is derived from its 15+ EiB of raw storage. This is more than 100,000x that of Arweave.

In addition, Chia remains committed to supporting the Raspberry Pi as its min spec node hardware. Bandwidth requirements are also minimal (1 Mbps is ample), and a $30 SSD will suffice for storing the blockchain database. For these reasons, Chia has over 90,000 nodes, each of which maintains a complete copy of the blockchain’s database.

Conclusion

Several decentralized networks have made the design decision to store user data as part of their consensus. While these networks have their own use cases, they add unnecessary complexity, which we believe results in poor security tradeoffs.

Chia’s blockchain was designed to be as simple as possible, with low requirements for running a node and no disincentives such as slashing. Due to its PoST consensus (the world’s first Nakamoto consensus since PoW), no permission is needed to join, and the system doesn’t rely on an algorithm to select a “leader” to create the next block.

Finally, Chia does have a decentralized system for storing real user data. Chia’s DataLayer™ is the technology that enables the World Bank’s CAD Trust to store metadata about the world’s voluntary carbon credit markets. However, even though DataLayer offers decentralized storage, it is a fundamentally different system than the ones discussed in this article. DataLayer is not used with Chia’s consensus, and for each DataLayer store, only a 32-byte hash of the data is kept on-chain.

Chia provides the best of both worlds – a blockchain secured by a Nakamoto consensus, as well as a decentralized platform for storing data off-chain. While we do plan to make some adjustments to Chia’s plot format, we will not be adding user data to the plots themselves.