In a previous article I argued that I needed some kind of journaling/backup for my Windows Azure Tables in order to make it easier for me to recover from my own screw ups. One type of journaling I suggested was command journaling. In this article I look at the practical limitations of command journals and conclude that while they are (somewhat) useful for notifying users who might have been affected by data corruption they aren’t likely in the general case to be re-playable so their real value is probably less than it might appear.
This article is part of a series. Click here to see summary and complete list of articles in the series.
2 It costs money to implement a command journal replay facility
3 It costs serious money to implement a command journal replay facility correctly
4 So what good is a command journal if we can’t replay it?
A But what if the command journal contained no failed commands?
A command journal is a log of all the commands a service receives from its customers. Command journals as I think about them typically don’t include any information about the response to the command (although this isn’t a requirement as we’ll see below). It’s also worth keeping in mind that command journals record the command (or a representation of the command) actually sent by the user. This means that a single user command could cause state changes and side effects in more than one place.
If a situation arises where in order to recover from a disaster I need to replay my command journal I’ll first have to write software that can handle such a replay. The command journal isn’t going to contain all the authentication and authorization information (or at least it shouldn’t) so I’ll need to create a separate command pathway that can execute the contents of the command journal but bypass authentication and authorization. This isn’t too big a deal though because this is probably just a layer on top of my existing command processing system. I’ll also, however, need to either figure out how to disable billing (since I shouldn’t be charging for my own recovery) or compensate for any billable events re-running the command journal causes. And then of course there are side effects of issuing commands like system alerts, e-mails, etc. I’ll need to identify all of those and either disable or compensate for them as well. If I plan on replaying more than a few commands I’ll also need to think about how to run the replay in parallel in such a way that system state isn’t corrupted and nothing is run out of order.
All of this is completely doable but it isn’t free and it’s an ongoing expense since every change in the system’s command functionality will have to be evaluated and potentially compensated for in the context of replaying the command journal.
Let’s imagine that command X is received, processing of command X caused some unknown number of state updates at which point the machine that was processing command X failed.
Presumably the command journal was updated with command X before any of the other processing could start (which means the command journal now adds an extra internal round trip to at least all write commands). So we know that command X was submitted. What we don’t know, especially since the machine processing command X crashed, is what part of command X got implemented. So when we replay command X what should we do? Let it succeed? Skip it?
In theory one could argue it doesn’t matter. After all, if no one knows what state the system is in aren’t we free to put the system in whatever set of states the processing of command X could potentially have generated? The problem however is what happens if someone either directly (by doing say a GET) or indirectly (by issuing a command which depends on the values that were potentially affected by command X) determined the state of the system? In that case the external actor is making decisions based on the state the system is in as a consequence of command X. So if we just replay command X without exactly replicating its actual (as opposed to theoretical) output then the state the system is in and the state people think the system is in will not be the same introducing more bugs.
This challenge could also be overcome if we, for example, included not just all commands in the command journal but also all responses to all commands in the command journal. Then we could write a simulator that ran through all the responses and determine if any of them directly or indirectly could tell us the values that command X created. I’m trying hard not to think too much about the expense of writing and maintaining that simulator as well as the cost of storing all that data.
All of this having been said it is completely possible to build a correct command journal. In certain restricted scenarios it might not even be that painful and could potentially be extremely useful. But in the general case it looks like a nightmare to me.
The scenario where I invoked the need for a command journal was when my service incorrectly executed commands it received from users and caused data to be corrupted. I wanted the command journal so I could look through the commands I received and potentially identify commands that could have tripped the data corruption bug. This would allow me to alert users who were most at risk from the bug and give them pointers on what might have been damaged. But in the general case when there is a bug all users need to be notified so we are just arguing if there is a one size fits all notification or if certain users may get an additional notification with more targeted information. How useful this facility is, is of course, context dependent.
Using two phase commit techniques it is theoretically possible to create a system where if a command fails any state changes would roll back. An no, two phase commit is not a sin in a distributed system. I’ve been on this hobby horse before, but the bottom line is that one can reasonably implement a two phase commit in a fully distributed system. But let’s say I did that. Let’s say I have a full 2PC system so that all my failures are well defined. I am still not sure that replaying the journal would actually work.
The main reason is that over any non-trivial period of time I probably deployed multiple versions of my website. So to fully replicate the actual behavior I would not just have to replay the commands. I would have to replay the commands with the exact version of the software that was used when the command was issued. In simple cases where all aspects of the command were run on a single box then I could just record the version the box was running in the command journal. But in non-trivial cases multiple different boxes potentially running different versions of the software (especially if I use a rolling upgrade system) could have been involved. I would need as part of the command journaling process to take a census of all of their versions and which parts of the command they handled.
Again, this is all doable. But my guess is that by the time I’ve implemented the 2PC logic with roll back, the version census system and built some kind of framework to host multiple simultaneous versions of my software I’ve probably already gone out of business from cost overruns.
So while, again, I think there are certain limited scenarios where a re-playable command journal is conceivable I don’t think I’ll be building one any time soon for any of my services.