Services built on a peer to peer web inevitably run into the synchronization problem. How do you keep state on multiple peers in synch? Below I walk through the assumptions and requirements that led me to believe a multi-master eventually consistent model is the best base to work off of.
Thanks to Adam Zell for reminding me to explain myself and giving me the link to RAFT which I hadn’t seen before.
2 Motivating scenarios
In a previous article I talked about needing a solution to a multi-master synch problem. But I didn’t provide a motivation for why I cared. The reason is that I believe I need such a solution for the peer to peer web work I’m doing. A few scenarios will help to motivate the core problem:
- A user has a phone, a table and a PC. They add a new user to their address book on the phone. Then move to the tablet where they edit a mail and want to send it to the person they added on the phone but before sending it they put the mail in drafts because they realized they aren’t finished. Finally they go to the PC and pull up the draft, finish it and send it.
- Jane and Josephine are friends. Jane has a private blog that only people in her blog security group are allowed to see. Jane posts new posts to her blog and Josephine can pull up Jane’s new blog posts on her devices.
3 Key Assumptions
When designing real world systems one has to take into account, well, the real world. Here are the key assumptions we make about the world today.
Occasionally Connected Today’s phones, tablets and laptops are not constantly connected. They are often put into sleep mode, powered down or out of range of network connection points.
Multiple Independent Devices Users have multiple independent devices.
My expectation is that in the long run the world will change enough to invalidate both of these assumptions so it’s important we provide future proofing, something I’ll touch on later in the article.
4 Consistency requirements for a single user
4.1 How much pain for what gain?
When I think about consistency I ask myself a fairly straight forward question - how much inconvenience will the user tolerate in order to have high consistency?
For example, if I’m building a cloud app that is spread across multiple data centers and I expect to lose consistency on the order of a few hours a year then I’m perfectly happy to, for those few hours, tell the user they can’t use the system. It’s worth it in order to have high consistency which makes programming, error handling and most important of all, the user interface, as simple as possible.
But with the assumptions above I don’t believe we can limit the times when consistency is lost to a few hours a year.
For example, phones die on a regular basis due to battery issues or lost connectivity. If the user has just two devices (say a phone and a laptop) then if either one isn’t available then the other device must stop accepting updates or high consistency could be lost. Even if a user has multiple devices and we use a consensus protocol (a la Paxos/Raft) it’s highly probable given our assumptions that say both the phone and tablet might be offline due to battery or connectivity reasons and so the PC stops accepting updates. Imagine a user goes on vacation with their phone and their house has a power outage, suddenly their phone effectively bricks because it can’t get to the PC or tablet.
There are solutions to these issues but they all involve the user becoming a computer scientist and understanding terms like manual fail over. I just don’t think that’s workable.
4.2 But... but.... eventual consistency generally sucks!
If one can’t be highly consistent then one has to be eventually consistent (at best). The happy news is that each device can operate completely independently of the other devices in an eventually consistent system. The sad news is that it takes a genius to handle the inevitable inconsistencies that occur in eventually consistent systems.
But in the case of synching a single user’s data we have a get out of jail free card that has proven its utility over the decades - last update wins.
Here’s the thing, if we are synching data created and maintained by the user then the user themselves tends to act as the synch token. So whatever change a user made last tends to be the correct state. So when there is a state conflict one just picks the last update and moves on. Yes, there are some edge cases where this doesn’t work but that’s fairly rare in practice.
A typical failure example is a user has a document they are using as a to do list. They add one entry on their phone and another entry on their tablet. It so happens that the devices can’t talk to each other while these updates were made. So if the devices finally talk and resolve the document’s state the likely result is that the older update (in the phone) will be thrown away in favor of the newer update to the tablet. The correct answer would have been to merge them but the devices didn’t know that. But situations like this turn out to be fairly rare in practice. Users learn, for example, to use a dedicated todo list app. :)
5 Synching data across users
In many, if not most cases, users don’t work on exactly the same file at the same time. Yes, it happens. But it generally seems pretty rare outside of the occasional office app and a wiki. What’s much more common is publishing. One can model email, blogging, mico-blogging and even comments on blogs as independent publishing events. To the extent that anyone else’s actions matter it’s usually just a matter of providing links for context. E.g. a comment might identify another comment it is replying to via a URL but that’s about it. So when users want to synch data from other users it’s typically a one way transaction - “tell me what you want to say to me” and that’s it. There really aren’t any conflicts. At worse you might have dangling references and even then only in multi-user editing/commenting scenarios. That’s why, for example, ATOM/RSS works so well for so many scenarios. Most social networking is really little more than a series of one way conversation.
So I would argue that in most cases a simple eventual consistency approach with ’last update wins’ semantics work here as well. Each ’source’ of data (e.g. say someone’s blog feed) is just read in and if in two readings from the source (say one reading from the remote user’s PC and another from the remote user’s phone) conflict then just pick the entries with the latest date.
The exception to these rules are scenarios that honestly I don’t see terribly often and that’s when two or more users are collaborating on exactly the same document/entry. In that case last update wins is a recipe for confusion. I suspect to support simultaneous editing (where by simultaneous I really just mean multiple users playing in the same file, not necessarily them all working at the exact same moment) one either needs to put in multi-version support so one can see and resolve conflicts (boo!!!!) or one needs high consistency (boo!!!). But honestly if we can nail the basic social network scenarios with the peer to peer web I’d be pretty chuffed.
6 Future Proofing - Logs!
Although I want to start off with eventual consistency using last update wins semantics I strongly suspect that eventually we’ll want high consistency. We’ll want to support multi-user editing of more and more complex data structures and that calls for highly consistent semantics since conflict resolution just sucks. But here’s the funny thing, the difference between multi-master synch and high consistency is less than it appears from an architecture perspective. In both cases there is typically a change log. Which means one has an infrastructure that is designed to put all changes into that log. Furthermore all systems hooked into the log have to be ready to receive changes at any time since a synch can push new items from a remote location into the log. The real difference is how the log is updated. In a multi-master scenario occasional updates using push and/or pull are used. In a high consistency system each update first has to get consensus before it gets into the log. But in either case there is a choke point in updating the log. So we should be able to switch an eventually consistent multi-master system to a highly consistent system with only a medium (as opposed to high) amount of pain.
In fact my guess is that we’ll end up with ’creeping high consistency’. Everything will start eventually consistent with last update wins resolution. Then parts of the system will be ’carved out’ and made highly consistent, either by using cloud services as a master or by entering a situation where user hardware has enough connectivity/battery that we can just assume they are always reasonably available and switch to a distributed consensus algorithm.
So we’ll get there, just not today.
7 Last question - what about the cloud?
In Adam’s comment on my last article that triggered this article he actually put in two links. One was to RAFT and this article was primarily to explain why RAFT (or Paxos) isn’t, in my opinion, the right solution for the peer to peer web at this moment. The other was to Firebase. Firebase is interesting as an example of how the cloud could be used to provide a completely workable high consistency model.
Firebase is a cloud service which works by putting stores on clients and then keeping those stores in synch via Firebase’s cloud. So the client can run off local data but all updates are immediately synch’d with Firebase’s cloud and any other clients who should see it. So the Firebase model, assuming Firebase itself is highly available, should work just fine.
But do we want to require that anyone who wants to work in the peer to peer web has to have an account with someone like Firebase? How do we solve issues like what happens if Firebase goes out of business? What if one wants to move between providers? What about traffic analysis issues (even assuming one is encrypting all the data so Firebase can’t see it)? What happens if you can’t afford to pay your Firebase bill that month?
The point of the peer to peer web is to create a space where users are in absolute control. It’s their hardware running on their software. Nothing centralized about it. That doesn’t preclude solutions like Firebase, it just means we don’t start with them. But I wouldn’t be at all surprised if one of the first places we see ’creeping high consistency’ is the introduction of a Firebase (or similar) adapter into the storage sub-system so that parts of the data can be synch’d via a service like Firebase. One can easily imagine this for enterprise scenarios or for commercial apps that run on the peer to peer web. Heck maybe Firebase will eventually go for a Dropbox/SkyDrive style model (or visa versa) and provide user facing capabilities. One can imagine users that want a clean high consistency system using something like Firebase and if things go horribly wrong they just fall back to the pure peer to peer model, no harm done.