Going off chain for storage

It turns out that building a distributed database with ACID behavior (aka a block chain) isn’t easy, it requires a lot of code and a lot of processing. As a result block chains like Bitcoin can process around 4 transactions/second. A pretty slow pace for a globe spanning system. To work around this and other issues I explore below, we hear more and more about off block chain storage. But it turns out that if you store off the chain then you lose the chain’s ACID guarantees. In many cases that loss is fine but it does call into question if the use case that can leverage off chain storage really needs the chain at all.
[Update: Thank you to Shawn Cicoria for pointing out that my original price quote for storing a Gigabyte of data in Ethereum was off by a factor of 32. My mistake is that I did the calculations forgetting that the gas price is per Ethereum Word which is 32 bytes.]

1 Why go off chain for storage?

Talk to an enterprise customer about storing serious data related to a block chain transaction (say all the details of a purchase order) and the first thing the educated customer will say is “But of course we won’t keep all the data in the block chain, we will keep it off chain and just store a URL pointing to the data and hash of the data on the chain”.
The reason for storing data off chain is that if everyone stuck all of their data on the chain then the chain would grow ridiculously huge and it would be difficult to get people to keep complete copies (which is critical for the guarantees the chain makes). Also keeping all this data would slow the chain down. Mostly because there are block size limits derived from data transfer rate limits (the Internet is not infinitely fast) so keeping all the data on the chain would reduce the number of transactions per block due to size requirements.
To see what this means in practice look at what it costs to store data on the Ethereum block chain. I use Ethereum because its data storage costs are explicit. In Bitcoin the transaction is what costs and one sneaks a tiny bit of data in. In Ethereum there is a reasonably complicated set of operations that let one store data on the chain. I say complicated because the price you pay depends on what you are storing.
For example, the unit of storage in Ethereum is a 256 bit word which works out to 32 bytes. To store a single non-zero 32 byte word costs 20k gas. At the time I write this gas costs 20 gwei. So to calculate the cost of storing one byte I need to calculate 20 gwei/gas * 0.000000001 ether/gwei * 20,000 gas/word / 32 bytes/word = 0.0000125 ether/byte.
Also at the time I’m writing this 1 ether goes for roughly $12.90. So each byte on the block chain at the moment costs 0.0000125 ether/byte * $12.90 / 1 ether = $0.00016125/byte.
To make this more manageable lets translate that into cost per GB of non-zero bytes which would be $0.00016125/byte * 1024 bytes/kilobyte * 1024 kilobyte/megabyte * 1024 megabyte/gigabyte = $173,140/GB.
To be fair the actual cost is likely to be less since 0 blocks cost less than non-0 blocks. But it gives you a sense of what storage costs in Ethereum if you are on chain. The point here isn’t the exact number but the order of magnitude.
Now let’s put some context on this. This storage is theoretically forever. As long as Ethereum exists it promises (more or less) to store that data and at least in theory (there are limits here that Ethereum worries about) to have that data available on demand for a smart contract to use.
Amazon’s S3 ”hot” storage costs around $0.021/GB/month for accounts with over 500 TB/month (the highest tier). From a durability perspective I think comparing S3 which is stored in multiple copies in multiple data centers is realistically comparable to Ethereum.
But just to be paranoid let’s throw in Azure with geographically redundant storage (e.g. not just multiple copies in multiple DCs, but multiple DCs in different geographic regions) based on blob storage for central US. Hot data goes for $0.0423/GB/month.
And let’s assume that to store the data at the equivalent “quality” of Ethereum we need to guarantee the storage up front for 100 years. And let’s assume we don’t get a big honking discount for paying up front. And let’s assume we have to store in both S3 and Azure.
So ($0.021 + $0.0423)*12*100 = $75.96/GB.
So $173,140 per GB versus $75.96 per GB.
And now we know why people go off chain to store things.

2 Off chain means no ACID

But something interesting happens when we go “off chain”. We lose our ACID guarantees. At a minimum we lose the block chain’s guarantees of durability. With the block chain the logic of the system requires a large group of people to keep copies of everything. This is what provides durability. But once we take the data off the chain then that durability is lost.
Now this isn’t to say that the off chain data can’t be made durable, but whatever mechanism is being used to provide that durability isn’t guaranteed by the chain. [A]  [A] No, smart contracts can’t fix this. A smart contract can at best incentive people either by penalties or payments to keep a copy of a piece of data around. But that isn’t a guarantee. If the value of losing the data exceeds the penalty/payment then the data will vanish.
Instead what happens is that we have devolved from ACID to non-repudiation. By putting the hash of the data on the chain it is still possible for the actual data to be lost or to be altered. But the hash at least lets us prove that an alteration (including deletion) occurred. But all we can do is point to the bad behavior, we can’t prevent it.
The whole point of the block chain is that it provides an ACID DB even in the face of bad behavior. By taking the data off chain, we lose that guarantee. Rather than requiring a lot of bad behavior, now one player (the holder of the off chain data) can act badly and the chain guarantees go bye bye.

2.1 Smart contracts and off block chain content

It’s also worth pointing out that in general smart contracts can’t validate off block chain content. There are, however, work arounds. For example, oracles can be used to publish information about off block chain information into the block chain. And then smart contracts can validate the oracle’s output. But of course at that point we are having to completely trust the oracle and again, the guarantees of the block chain are broken. Yes, I know about enclaves and such but that’s a whole other article.

3 So does it really matter?

For the most part folks don’t seem to care that their data is off chain and therefore they are losing the guarantees of the block chain. For example, let’s say we use the traditional supply chain scenario. Manufacturer A creates a statement describing their component and its input parts and they post that to some public HTTP server and then post a URL and a hash of the document to the block chain. If two years later there is an audit and Manufacturer A decides to disappear that statement (because they realize it contained something embarrassing or maybe illegal) there is really nothing that anybody can do. The data is gone. It’s not on the chain so it doesn’t have the durability guarantees of the chain.
Now maybe people don’t care because there is a pretty straight forward work around. For example, when Manufacturer B receives delivery of the component from Manufacturer A they can grab a copy of the statement from the manufacturer’s public HTTP server (e.g. follow the URL in the block chain), validate the hash and then store their own copy. Now Manufacturer A can disappear their public copy (the one the URL points to) any time they want. It won’t matter. Manufacturer B has their own copy and can prove, using the hash, that it is the correct copy.
In fact, the above behavior, pulling the public copy is actually required for a properly functioning block chain with off chain storage if we are going to maintain the A, C and I from ACID (we already know we lost the D).
For example, imagine that Manufacturer A isn’t playing nice and that the content in the manifest is wrong or Manufacturer A posted an intentionally incorrect hash (to provide plausible deniability for any errors later discovered). If the data is off chain then before Manufacturer B can post acceptance to the chain, Manufacturer B has to pull down the data from the public URL, validate its correctness, validate the hash and then and only then can they post acceptance. Since they anyway have to pull down their own copy of the data, durably keeping their own copy just isn’t that big a deal.
So we can see that taking data off the chain doesn’t blow up the world.

4 Wait, why didn’t it matter?

It’s worth stopping for a second to think about why going to off chain storage didn’t blow things up. The reason is that the block chain in the supply chain scenario is not being used to maintain a coherent global state. If you are running say a coin system it’s really important that you can prove that no coins are added, lost or change hands except in the allowed manner. This requires a globally consistent state and hence requires all the guarantees of a block chain.
But in the supply chain scenario there is no globally consistent state to enforce. We aren’t worried about Manufacturer A trying to sell the same physical component twice because that component is, well, physical. It exists in the real world so if Manufacturer B has it then we know no one else has it. In this case the physical world acts as our guarantee of global state. If Manufacturer A tries to sell non-existent components then eventually physical delivery won’t happen and law suits will fly.

5 A step too far?

Let’s look again at the supply chain scenario in the previous section. Imagine we don’t use a block chain at all. In that case when Manufacturer A delivers the component to Manufacturer B as part of the electronic manifest would be the signed attestation of the source of all the parts of the component. Manufacturer B would then durably store this assertion and call it a day. No chain. Nothing. We’re done.
So why do we need the chain?
The main reason I hear people say is - well so we can even do payments through the chain. In other words, it’s not just that Manufacturer A moves its supply chain tracking info via the chain. But once Manufacturer B confirms everything is in order then it can transfer money to Manufacturer A for the component via the block chain.
Now presumably when it comes to moving money we aren’t using a private chain or a consortium chain, we are using a public chain with a large mining/proof of stake base and active foreign exchange markets. At least if we want to pay in some kind of coin. Because if this is just a letter of credit we can deliver that point to point and we don’t need the chain.
But while we do need a chain for coins, we don’t need it for supply chain tracking. When Manufacturer B is satisfied with physical delivery of the goods and electronic delivery of the supply chain information then they can do a coin transfer to Manufacturer A and even put in a comment to the ID of the off chain transaction. Problem solved.
Sure, we need block chain for coins. But notice that coins all happen on chain. There’s a reason for that. Coins require the full block chain guarantees. But when we have people using off chain storage, by definition, they don’t need all the block chain guarantees.
The other scenario I hear people bring up for why they need the chain, even in the supply chain scenario, is transparency. But by virtue of being off the chain the data is already being published somewhere that people can get to it (hence the URL). So at most the chain is just the world’s most expensive index with a bunch of guarantees that aren’t very interesting in this scenario.
If people just want a distributed storage mechanism then they can use Freenet or IPFS or what have you. They are much cheaper than a block chain.
But there is one more reason people want to use the chain, even for off chain storage scenarios and that one is complicated enough that it gets its own section.

5.1 Key repudiation resistance

Imagine that Manufacturer B gets a lot of complaints about their product. Manufacturer B does a root cause analysis and realizes that the complaints are all related to the component that Manufacturer B purchased from Manufacturer A. Manufacturer B then does a tear down on the component and realizes that parts of the component did not come from the sources that Manufacturer A had claimed in its signed supply chain manifest. So Manufacturer B gears up to take Manufacturer A to court to sue them for fraud.
Just at that moment Manufacturer A announces that their signing key has been compromised! In fact, some criminal has actually posted their private key on the Internet!
So now when Manufacturer B shows up claiming that Manufacturer A defrauded them, Manufacturer A counter claims against Manufacturer B accusing them of fraud! Manufacturer A says that the “signed” shipping manifest produced by Manufacturer B is a fraud generated with the compromised key and that Manufacturer B is just trying to extort money from Manufacturer A.
But if all the shipping manifests (or just their hashes) are on the block chain then we have what amounts to an independent witness. The block chain itself and the proof of work and related signatures can show that the hash was added to the chain at a certain date and has not been altered. This proof is not related to Manufacturer A’s key so even if the private key is leaked the proof still holds.
Unless of course Manufacturer A conveniently says that their forensic evidence points to the key compromise having happened before the record was added to the chain. In which case the witnesses witnessed nothing of consequence.
Furthermore this assumes that courts won’t trust Manufacturer B when it claims to have recorded the correct manifest from Manufacturer A. As I've discussed previously the proof required by courts and the proof required by geeks are very different beasts. Looking around (e.g. here) it doesn’t seem like much has changed. So chances are Manufacturer B’s submission of the signed manifest from A would hold up in court, key compromise or no.
And if Manufacturer B is really paranoid and wants some kind of Notary Public then by all means they can use the block chain to do so. It’s not going to be that expensive (although it will take a while for the statement to be validated, that is, to make it on the chain) but they can do that without any relationship to Manufacturer A. In other words, Manufacturer B can just take whole groups of these signed supply chain statements from a ton of sources, hash them together and put the master hash on a chain and they are done. My guess is that it would be even cheaper (and an order of magnitude faster) for them to just use an online notary service that provides its own signature and that service would still probably back stop its work with a public chain.
So yes, if you need a notary, the public chain is useful. But notice that is really outside the scope of the interaction between Manufacturer A and Manufacturer B. For the pure supply chain scenario no block chain was needed.

6 So when would someone using off chain storage actually need the chain?

Now let’s tweak the scenario a bit to show an example where we are using off chain storage but we really need the block chain (o.k. even this scenario has some exceptions where we don’t need the block chain but we’ll get to that in another article). Let’s imagine that MaybeEvil Distillery has just corked a single barrel single malt whiskey that is going to age for three years. But MaybeEvil Distillery is a bit cash short and they would like to get paid for the Whiskey now instead of three years from now. So what they do is publish an off block chain document that attests to all the components and handling of the whiskey and even includes a signed statement from an independent auditor that physically went to MaybeEvil Distillery and counted the number of barrels and did some random sampling to show that they contained what they claimed to contain. All of this information is published off chain, the chain just contains a URL and hash to the off chain data. That URl and hash are then part of a smart contract to run an auction for the barrels.
Even though we are using off chain storage, we actually have a global state we need to maintain. Specifically, how many barrels there are and who owns them. The buyers want to make sure that MaybeEvil can’t double sell any barrels (a problem that wouldn’t be detected until three years later). In fact, we actually not only need a global state, we actually need a globally ordered state. The reason is that there is always wastage. Stuff happens, barrels get damaged, corks go bad, etc. So if a barrel is destroyed then the person who bought the last barrel will lose their claim (and have to be refunded) first. So we need to maintain a globally ordered list in order to figure out who to compensate. By having a globally visible and consistent state everything MaybeEvil does is visible. This is exactly the kind of scenario that block chains are very good at.
Now this still isn’t perfect. MaybeEvil could “forget” to report damaged barrels or wait until the very end to report them (years, potentially, after the damage happened). Or they could burn down the distillery, collect the insurance and then abscond with the money.
So the block chain isn’t magic. It can only work its full powers if the items being bought and sold are purely virtual (e.g. coins). When we are using virtual coins to reflect physical items then unfortunately the block chain can’t stop crime. But at least we can now see a case where we are using off chain storage but the block chain is still useful.

7 Conclusion

The block chain does some amazing things but accomplishing those goals comes at a very real cost. Byzantine attacks are a pain and require lots of resources to deal with. So if we don’t need to survive them then we don’t need the block chain. This article isn’t about saying the block chain is bad. It is about saying that we should use the right tool for the right problem. If your block chain application is based on off chain storage then there is a non-trivial probability that you may not actually need the block chain. It’s worth at least thinking about.

Leave a Reply

Your email address will not be published. Required fields are marked *