The CAP theorem and modern data centers – for now, choose consistency!

The dominance of the commodity machine model for data centers is so complete that one forgets that there was ever any other viable choice. But IBM, for one, is still selling lots of mainframes. Nevertheless the world I live in is built on top of data centers that contain a lot of commodity class machines. These machines have a nasty habit of failing on a fairly regular basis. So when I think about the CAP theorem I think about it in the context of a data center filled with a bunch of not completely reliable boxes.

In that case partition tolerance (which, as I explain below, ends up meaning tolerance of machine failure) is a requirement. So in designing frameworks for the data centers I work with the CAP theorem makes me choose between exactly two choices - do I want consistency or availability?

My belief is that for the vast majority of developers, at least for the immediate future, they need to choose consistency.

Um... what's the CAP theorem?

The CAP theorem is explained and proven in this theorem paper which is reasonably approachable. There are plenty of articles on-line that summarize the CAP theorem so I'm not going to write another one. I do however want to point out that the terms used in the CAP theorem, consistency, availability and partition tolerance don't mean what the plain meaning of the words imply.

Consistency is closest to its normal meaning but as the theorem paper points out it might be easier to think of this as meaning atomic and consistent in the AC part of ACID sense.

Availability, as I explore later, doesn't mean availability of the entire system but rather availability of a particular piece of information to be read or written to.

Partition tolerance is a tiny bit tricky. It's plain meaning, e.g. dealing with what happens when machines can't talk to each other, is part of the definition. But it also encompasses what happens if a machine fails. After all, if machine A is trying to talk to machine B and machine A isn't getting a response it's irrelevant if the response didn't come because machine B failed or because there is a network partition. The message didn't get where it was supposed to, therefore the communication failure, from the perspective of the CAP theorem, is modeled as a network partition.

The CAP theorem says of the previous three system qualities, consistency, availability and partition tolerance, we only get to choose two. (Wait, did I just summarize the CAP theorem? D'oh!) Therefore when designing distributed systems I have three choices, consistent/available, consistent/partition tolerant or available/partition tolerant. I explore all three choice below.

Consistent and Available - Not an option

In an ideal world I would like all my service's data to be consistent and available. But CAP says I only get that if I'm willing to essentially fail if there is a network partition and as previously discussed a network partition also includes machine failure.

And to be fair the consistent/available option is actually pretty common. Anyone who is running a single box that hosts their database is choosing this option. So long as the box is up their data is consistent and available but if it (or its network tap) goes down then that's that until the box gets fixed.

But as I mentioned above I come into this situation with a dependency on data centers filled with commodity machines that tend to fail on a pretty frequent basis. So wishing away machine failures (or even network failures which, although rarer, do happen not infrequently) is a non-starter.

So I have no choice, whatever design I use, it must be partition tolerant (read: keep working in the face of machine failure). So choosing consistency and availability over partition tolerance isn't a choice available to me.

Consistent and Partition Tolerant - Easy to program to

I'm primarily in the development platform business. I build platforms that other people use to build their software. So I spend a lot of time worrying about abstractions that my customers can easily understand and live with. The model most programmers are most familiar with is one in which the world is 'consistent'. By which I mean that when one wants to change system state one can do so and either all the changes happen or they don't. Furthermore when someone comes along to read values they will see the changes that have been made. This is a world that is pretty easy to reason about.

But if I want to offer consistency and if, as I have previously argued, I must have partition tolerance, then CAP says I have to give up availability. Which might seem nuts. Who the heck wants a system that isn't available? But remember, availability is not about the global state of the system, it's about pieces of state that have to be mutually consistent.

Imagine you are building a website for your car rental company. You want to host the website in the cloud to save money and reduce time to market. You come to me looking for storage infrastructure and I say "Hey, look, 99.99% of the time when your customer comes to the website they will be able to access their data, put in an order for a rental car, see what cars they have rented, etc. but 0.01% of the time the customer request will fail and btw, typically that failure will resolve itself within a minute or two."

To most businesses this is a fine trade off. Programming and maintaining programs that expect consistency is an order of magnitude less work than dealing with the lack of consistency (see the next section). So a small number of failures that are quickly resolved is probably an acceptable trade off.

Available and Partition Tolerant - Takes a licking and keeps on ticking

Still, some companies, most famously Amazon's Dynamo, take availability very seriously. They don't ever want a customer coming to their website and told 'sorry, we can't help you right now, try again later.' They have done the math and figured out that even rare failures, due to the enormous number of customers Amazon deals with, were costing them real money. So Amazon was willing to deal with the implementation headaches of reducing consistency in return for getting higher availability.

Imagine we are using the previous system which chooses consistency over availability. Let's say that Andres wants to rent a car. He comes to the car rental site. The front end machine tries to access Andres's rental records which are kept on machine Alpha (in reality this would probably be a group of machines using a quorum protocol with an elected master). But Alpha isn't available (i.e. the master has died and the system is in processing of elevating another member of the quorum to master or enough machines have died or been partitioned so that quorum is lost). So the website has no choice but to say "please try again later" until Alpha (or really the quorum) can be brought back online.

Now let's look at a world with a lower level of consistency. Andres comes to the website and the front end machine tries to get to machine Alpha and fails. But rather than sending Andres away the front end machine looks for another back end machine, let's call it Beta, and asks it to handle Andres's rental records. Beta agrees. So Andres goes through the rental process and rents a car. All this information is recorded on machine Beta.

In the meantime it turns out that Beta failed before Alpha came back up so the information that Beta had about Andres is currently offline. Andres navigates back to the website to check on something about his order. The front end machines goes looking for machines who know about Andres and finds Alpha who is now back up. Much to his surprise Andres is now shown that he has no rental order! After all, Beta never told Alpha about the order and Beta is currently down. We have a data inconsistency.

Andres, frustrated by this, puts in a second order and leaves. Meanwhile Beta comes back up and finds Alpha. Now there is some confusion. Both Beta and Alpha have orders from Andres for a car. Did Andres mean to rent two cars? Is the newer order a replacement for the older order? What should the system do?

All of these problems are solvable. It just takes very careful thought about all the possible failure states and code that can identify and resolve those failure states. The process of taking inconsistent data and making it consistent over time as failed systems come back on-line and share what they know is called 'eventual consistency'.

Eventual consistency is an incredibly powerful mechanism for making services more resilient but it isn't free. Modeling and dealing with the potential problems are non-trivial. Much like the inappropriate optimism around optimistic concurrency I suspect that in practice most implementers would do well to stay away from eventual consistency frameworks. At least until they can be reduced to well understood design patterns (a la Amazon's shopping cart example).

Leave a Reply

Your email address will not be published. Required fields are marked *