Optimistic Concurrency – A False Panacea

As soon as an Internet scale service expects to allow clients to both read and write data it's a sure bet that optimistic concurrency will come up. After all, how else do you solve the lost update problem without drowning in a sea of performance crippling locks? Better yet, implementing optimistic concurrency in a service is pretty trivial. You just need some kind of change indication system (dates, e-tags, updategrams, etc.) and a 1/2 decent transactioning system (available off the shelf) and you're pretty much done. But unfortunately all optimistic concurrency does is move the 'lost update' lump under the carpet and make it the client's problem. Moving the lump isn't bad but it does mean that before declaring victory you absolutely must be sure that your clients have a workable solution for the merge issue.

[Ed. Note: I updated the article in response to a number of comments on the web.]

A classic optimistic concurrency scenario is a CRM system. Several sales associates all work with the same customer and they all contribute to maintaining a common customer record in the CRM system. However sales associates are often off line and make changes when they are offline that they expect to synch when they come on-line. In most cases only one associate made a change during any random period so the updates usually happen without conflict. But what happens when there is a conflict? Do you tell the sales associate "Sorry, I've deleted all your changes because there's a conflict?" Or better yet do you show them some kind of generic merge UI that shows current values, their proposed changes and suggestions for how to resolve the two? Anyone who has spent more than 5 seconds with real users understands just how much hatred the first scenario will cause and how much confusion the second one will result in.

Optimistic concurrency fundamentally means that when the optimism isn't justified the client has to clean up the mess, which usually means some kind of merge. And let's face it, humans don't handle merges very well at all. So in practice when there is a conflict things tend to end up on the floor. In fact, most 'consumer oriented' systems I know don't use optimistic concurrency at all for this very reason. They tend instead to use 'last update wins'. This turns out to work pretty well in practice because most users instinctively know they don't want to deal with merging and will quickly break up a work space so that each user has their own section. I see this on Wiki's all the time where different people will take ownership of a particular section of the page. People also tend to be pretty good at talking to each other so that pervasive changes are handled via a mutual agreement that everyone else will stay out until the person making the changes says o.k. Of course all these mechanisms are really just informal locks. And that makes sense. Humans understand exclusivity, locks make intuitive sense, so if the system (for good performance and other reasons) won't provide locking the users will figure out their own way to implement locks, even if informally.

Still, there are certainly examples of systems where optimistic concurrency makes a ton of sense. For example, Libor's blog gives an example of a financial trading system where an order is sent in and optimistically recorded but if previous to execution the market price changes the order will be rejected. In this case a conflict isn't resolved via a merge but rather via a rejection. Excellent. This solves the merge problem for this scenario.

In my own experience I've seen order processing systems where in the vast majority of cases the optimism is justified, orders will go through without problem. But in some cases, often several days after the order was submitted, one or more things will go wrong and the order will have to be rejected. Making the multi-step order process transparent enough to allow a customer to understand what went wrong and potentially try to fix it is usually too expensive so instead when an error occurs the system will abort the order and raise a red flag. The system admins then have to go dig through the various databases to figure out what the heck happened and how it can be fixed. This isn't pleasant but because rejections are relatively rare it is more cost effective to just handle those rejections manually then to redesign the entire system.

In both of the previous cases optimistic concurrency worked just fine, but only because there was a well understood way of dealing with cases where the optimism wasn't justified. Which brings us to the point of the article. Optimistic concurrency is good stuff but it isn't magic. Before allowing some server side developer to wave off scalability issues by sprinkling magic 'optimistic concurrency' pixie dust make sure you understand exactly how clients will deal with conflicts.

11 thoughts on “Optimistic Concurrency – A False Panacea”

  1. What would be your concurrency approach for offline data? You cant lock, so you always have an optimistic assumption (no matter if you have a revision number, timestamp or change journal). It is all optimistic locking, only conflict detection and resolution is different.

  2. Indeed and there’s another article I need to write on ‘off line friendly data’.

    Take comments on this blog, comments are posted using append rather than replace (E.g. this isn’t a comment wiki page). Append designs make it possible for multiple people to add data without conflict.

    I mentioned in the article that when groups need to work on common data they will try to partition the data to make conflicts less likely. Good services will support this directly.

    So, for example, rather than having one big comment field to provide customer information it would be better to have some kind of multiple record structure (possibly in a hierarchy but that’s probably overkill) that can be updated individually. You can think of this as ‘free form’ information about a client showing up as a series of cards where different members of the team can take ownership of their own card and update it at will.

    But eventually you will run into data that simply can’t be easily made conflict resistant, for example, the name of the company may change or their primary address. But in general such data changes very infrequently and if there is a conflict that almost certainly means that something bad really has happened and the people involved should talk to each other (E.g. if 3 sales people think there are 3 different primary addresses for an account that probably means they need to talk).

    There are no miracles but there are things service designers can do to make their services more off-line friendly but first they have to realize there is a problem. In too many cases I’ve run into service folks who think that once they sprinkle the optimistic concurrency pixie dust on their system they are done and all off-line/conflict issues are resolved. It a’int necessarily so.

  3. Not sure I follow your point. Yes, it sucks when optimism betrays you. That’s always true. No one ever said OC was a silver bullet AFAIK. It’s all tradeoffs, depends on the app.

    But as for occasionionally-connected clients: 10-20 years from now, the whole notion will be very quaint. Why waste energy worrying about them? :)

  4. I agree, OC isn’t a silver bullet but unfortunately I have run into many instances where server side folk think “Oh look, I provided OC, everything is fine” without considering for even a second the burden this places on clients. In many cases OC is the wrong choice because it’s just too damn hard to make work in practice.

    OC will actually still be necessary even in a completely connected world. The issue isn’t so much connectivity as the operational and functional limitations of locking.

  5. It’s funny that you mention wikis, since they use optimistic concurrency control well. Do you really think that people would be so conscious of not trampling other people’s changes if they didn’t even _know_ (by the dreaded “Edit Conflict” screen) that they were doing so? Optimistic concurrency control is much, MUCH better than locking or last writer wins. Making someone wait for no reason is awful, and so is throwing away work without ever telling anyone about it.

    By the way, pcal, there will be many, many more occasionally-connected clients in 20 years than there are today. There are already many places in the world where computers run off generators, occasionally connected over phone or satellite. The lack of stable power won’t be solved soon, but that hasn’t stopped people from being excited about the Internet. And the OLPC project will introduce millions of occasionally-connected laptops. And even in the developed world, there are places without line-of-sight where wires are hard to run.

  6. There are many cases where last update wins is actually the best choices (say, for example, updating your own address book) as well as when it’s the worst choice (say, updating a Wiki page).

    And, mentioning wikis, I’d also suggest that conflicts tend to be rare enough in practice that most folks won’t run into the conflict message all that often (hence the ‘optimistic’ in optimistic concurrency) and when they do the other person’s changes are more likely than not in a different part of the article then they are working on. But even then I’d point out that the kind of people who can handle merges on wikis are a tiny percentage of the computer using public.

    But as I say in the article the point isn’t that optimistic concurrency is bad. There are many wonderful uses for optimistic concurrency but implementers need to keep in mind the inherent difficulties normal people have in dealing with merging and make sure they are prepared to handle the consequences.

  7. I’m not sure I agree with your address book example. You’re talking about disconnected operation like a PDA, right? What if the clock’s off? It’s not necessarily trivial to determine the last writer in an offline situation.

    But yes, I agree with your general statement that it’s important to choose your merge strategy with care.

  8. The term ‘last write wins’ typically means that all writes are submitted whenever they are submitted and who ever does the last update wins. But I understand your confusion as the term isn’t clear enough. There is no actual checking of dates.

    What you implied in your comments is a more traditional optimistic concurrency scenario where a writer checks the last modified date to see the server state has changed more recently than the client’s last update. In that case the PDA’s clock would matter but since the PDA has to be on the net to make the update one assumes it can avail itself of NTP to make sure it’s clock is reasonably accurate. An alternative strategy are etags which don’t require a clock but that’s another story.

  9. Ahh, so by last write, you’re talking about the last write to the server, not the last write by an actual user. I understand now. I wouldn’t like using that system, but to each his own, I suppose. (Incidentally, I’m familiar with the term “last writer wins”, but not applied it to offline synchronization, so this distinction has never come up.)

    I disagree with your new description of what I thought you meant the first time. (Rechecking that sentence…yes, I think it’s right.) It could indeed be considered OCC, but your details aren’t quite right.

    In OCC, the server gives the client a token (maybe the server’s timestamp when the last write took place, a UUID, or the user-visible record itself) which the client is supposed to return unmodified along with the new record. Its conflict detection can just do an equality comparison.

    In this algorithm, the clients would be keeping a “last modified” timestamp that they update with their current time whenever the user actually writes a record. It’s not necessarily used by the server’s conflict detection; as you say an etag is fine. But it is used in the merge; the freshest record wins. (And as we’ve both said, it’s comparing different machines’ timestamps, so if their clocks are out of sync, bad stuff happens.) You need the timestamp here; etags give you no way of determining freshness.

  10. It depends on what you mean by freshness.

    In the simplest optimistic concurrency scenario the client says “If the resource is not in the state I expect it to be in, then fail.” In that case all that is needed is an etag that describes the state the client is expecting (e.g. the etag returned on the last GET). If the etag the client submitted and the resource’s current etag don’t match then the request fails.

    Nice and easy and no need for a timestamp.

Leave a Reply

Your email address will not be published. Required fields are marked *