So let's say I want to build a nice highly consistent multi-data center store, something like Megastore. Most everyone at this point has something like Bigtable already deployed in their data centers. What they typically don't have is a way to keep different instances of their table stores guaranteed consistent with each other across DCs. Megastore steps in to address this issue. But this begs a fundamental question - what's better, to wrap a Paxos coordinator on top of existing table stores or to build a new Paxos native storage service?

1 The architectures

Figure 1 Two possible Megastore style architectures

In the first architecture, the one on the left, each DC contains a machine that is running Paxos. Think of this as a combination of the replication server and coordinator in the Megastore paper. All reads and writes go through the wrapped Paxos instance. Writes must always be relayed to the local table store instance (which runs on the existing table store infrastructure in the same DC) for persistence. Reads, however, can often be handled directly out of the Wrapped Paxos Instance via its local cache.

In the Native Paxos Instance case there is no existing table store that is being wrapped. Instead each Paxos instance has its own local table store with its own persistent storage (Read: disk) physically on the same machine. There is no call out to a completely stand along table store service.

So really the only difference between the two architectures is that the Wrapped Paxos architecture uses the Paxos boxes as caches with a separate table service and all of its infrastructure handling persistent storage. With the Native Paxos architecture there is no separate table service, no separate naming/replication/locking infrastructure, just a local single instance Table Store.

An example of a wrapped Paxos instance could be a VM running Paxos in Windows Azure that talks to Windows Azure Table Store or a machine in a data center running Paxos talking to a local installation of HBase or Mongo or CouchDB. An example of a Native Paxos instance could be a machine running Paxos who records all data persistently to the local machine using MySQL or single instance (e.g. one box) deployments of HBase, Mongo or CouchDB.

2 Issues to consider in choosing an architecture

2.1 Are local machines recycled with abandon?

If running in a cloud that doesn't provide good guarantees that machines (and their local disks) won't be recycled with abandon then one has little choice but to use a wrapped approach that leverages some existing table store provided by the cloud provider.

It's tempting to argue that the recycling issue isn't that big a deal. Imagine, for example, that one's cloud provider has five DCs in a particular region and one has deployed one's Native Paxos Instances across all 5. In that case the probability of losing all five machines is really low so it's o.k. if a few get recycled here and there (with total data loss), we can just resynch from the survivors. But to think this way is to ignore the very real possibility of what's called a “poisoned value.”

A not uncommon bug in replicated systems is that one gets a request with some value that triggers a bug that causes the local machine processing the request to fail hard and possibly get fully recycled by the underlying cloud infrastructure. The problem is that replicated systems like to retry so the message is likely to get repeated to the next instance of the system who then crashes unrecoverably and so on. No amount of replication can protect against this kind of systemic error. So it's usually considered best practice to have some kind of durable storage, just in case. But if all the crashed machines can potentially be fully recycled with their local disk state lost then there is no real durable storage. So yes, this issue probably matters.

2.2 Lowering cost

The Wrapped Paxos Instance can be made potentially cheaper than the Native Paxos Instance. The reason for this is that with the Native Paxos Instance the capacity of the cluster is limited to the memory and disk space on the smallest machine in the cluster. In the case of the Wrapped Paxos Instance one can represent substantially more data than would fit on a single box's RAM or disk. And typically, especially for cloud based solutions, the cost of storing data in a cloud provider's table store is typically much less than running a VM and storing the data on the VM's disk. So money is saved both in having fewer Paxos clusters but also in storing the persistent data more cheaply. This assumes, of course, that one is willing to take the latency penalty of cache misses, but that is a configurable choice.

Note that in theory the cost savings shouldn't be there. For example, most table stores will replicate a value three times inside of a data center. So if one has one's store replicated across three DCs then a Wrapped Paxos approach will end up with three copies in each DC for a total of six copies. But a Native Paxos approach would just have three copies total (one per DC). But in reality one is highly unlikely to be happy with just three copies in a distributed system (for latency reasons, if nothing else, a single reboot on a machine due to a software or OS upgrade means a DC doesn't have any local representatives). So what's substantially more likely is that one will have either five Paxos instances (two DCs have two representatives and one DC having one, mostly for quorum reasons). So in practice the replication costs between the two solutions aren't that different in pure theory and given the practical realities of the lower cost of storing data in a cloud providers table store the Wrapped Paxos approach is likely to be cheaper.

Note however that this all assumes that it is practical to 'under provision' the Wrapped Paxos Instances. In other words it's o.k. that they don't have copies in their caches of all data. It's o.k. that the write capacity of the system is limited to the capacity of the under provisioned cluster, etc. If latency and throughput requirements prevent under provisioning the Wrapped Paxos Instances then there is likely no real cost savings.

2.3 Increased availability

All things being equal (and when are they ever that?) the Wrapped Paxos approach should be less available than the Native Paxos approach. The reason is that the Wrapped Paxos approach introduces a whole level of extra complexity - a full distributed (within a single DC anyway), fully replicated table store. This is an entire major subsystem whose problems will at best just kill all writes and can easily also kill all reads (if the Wrapped Paxos Instances are under provisioned). Now one of the advantages of a multi-DC approach is that if one DC's table store is having a bad day at least the other DC's table stores are hopefully still working (unless one has a poisoned value). But losing a full DC is likely to do some pretty bad things to latency and throughput thus reducing availability.

How critical this factor is really depends upon the maturity and performance of the table store being used. If it's known to be highly available and highly reliable then in practice this consideration may not necessarily apply.

3 A non-issue - lowering latency

In theory the Native Paxos Instance should be faster than the Wrapped Paxos Instance for writes in particular. After all the Native Paxos Instances only need to write to their local disk drives while the Wrapped Paxos Instances must go through a full write to the table store service which is almost certainly on a separate machine. In the worst case a write to the table store will require a full name resolution on the table store service to figure out what machine currently is the table store master for the desired value and then having to send a write to the table store master who then has to replicate it to its two slaves. But given the latencies involved with doing cross data center writes as part of the Paxos algorithm it isn't clear if this extra overhead on writes really makes all that much of a difference.

Reads may or may not be faster depending on the cost tradeoff made in deploying the Wrapped Paxos Instances. If low latency is a priority then each Wrapped Paxos Instance can have enough cache to hold all the values it is responsible for overseeing (even if they have to spill over to disk).

So one suspects that in practice latency is not a compelling reason to choose one architecture over the other.

4 Conclusion

In the end the choice is not eternal, one should be able to switch from one architecture to the other. So one could start with a wrapped approach since the table store infrastructure might already be available and then switch to native if that should prove to have useful advantages. My general guess is that anyone who can't under provision will probably want to run native mostly because it removes a whole layer (the table store service) of things to go wrong.

Wrapped or Native Paxos?