How Filecoin providers prove they’re storing data

https://mobile.twitter.com/Whyrusleeping/status/1470504592001622027

Filecoin is a ‘proof of storage’ network built on IPFS. Traditional PoW consensus mechanisms incentivize miners to buy ASICs or GPUs. Filecoin incentivizes miners to buy more hard drives by basing consensus on storage proofs.

There’s a few different ways to prove storage capacity. I’ll describe Filecoin’s storage proofs as if they use an interactive protocol because it’s easier to reason about it that way, but they’re actually zkSNARK proofs on-chain.

Filecoin’s storage proof has the ‘prover’ make a big merkle tree from the data being stored, then tell the ‘verifier’ what the root hash is. The verifier asks for a random leaf node of the tree, and the prover needs to respond with a merkle proof linking it to the root.

If the prover actually has all the data, they will be able to respond with valid proofs. If they don’t have the data, they won’t be able to produce the proof, and will be caught and penalized. They repeat this until the verifier is convinced the prover isn’t cheating.

This iterative ‘prover and verifier’ back and forth is turned into a single non-interactive proof with some crypto magic involving zkSNARKs, randomness, Fiat-Shamir, and special hash functions. If there’s interest I’ll do a future thread on these.

Sidenote: We actually built the highest volume deployment of zkSNARK circuits in existence to date, thanks in large part to the engineering efforts of @dignifiedquire and @porcuquine2 . Stats: https://research.protocol.ai/sites/snarks

But wait – what happens if the miner just says all their data is zeros? Or if their data can be easily generated by a function? This would allow them to respond to the proof challenges by generating the data on the fly, without actually storing the data, right?

We prevent this by using an encrypted version of our raw data as the leaves of the merkle tree. This encrypted version is time-intensive to create, because the encrypted version of each chunk of data depends on every other chunk of input data. (via a “depth robust graph”)

A faker can’t generate data on-the-fly, because they would still have a time-consuming encryption algorithm to run before they could prove a connection between their generated data and the merkle root.

A number of other networks have figured out general proofs of space, but Filecoin is the only one that has managed to do them over arbitrarily chosen input data, incentivizing the storage required for consensus to actually be useful.

Storage proofs allow miners to verifiably prove they have data on disk, but to ensure data is kept around, there is also a “Proof of SpaceTime” – a periodic submission of the merkle tree proof I explained earlier, except on data the miner has already committed to the chain.

For Proof of SpaceTime, the network picks randomly across each prover’s entire set of storage, and asks them to submit proofs for the selected data. If they don’t have the data anymore, they won’t be able to respond with their proof in time, and will be penalized.

That’s the core of how we built an incentivized storage network. Of course, there’s a lot more to how Filecoin works. Let me know if you have questions or want to know more about specific topics.