In a previous article I argued that I needed some kind of
journaling/backup for my Windows Azure Tables in order to make it
easier for me to recover from my own screw ups. One type of journaling
I suggested was command journaling. In this article I look at the
practical limitations of command journals and conclude that while they
are (somewhat) useful for notifying users who might have been affected
by data corruption they aren’t likely in the general case to be re-playable
so their real value is probably less than it might appear.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Contents

1 Defining Command Journaling

A command journal is a log of all the commands a service receives from its
customers. Command journals as I think about them typically don’t include any
information about the response to the command (although this isn’t a requirement as
we’ll see below). It’s also worth keeping in mind that command journals record the
command (or a representation of the command) actually sent by the user. This
means that a single user command could cause state changes and side effects in more
than one place.

2 It costs money to implement a command journal replay facility

If a situation arises where in order to recover from a disaster I need to replay my
command journal I’ll first have to write software that can handle such a replay. The
command journal isn’t going to contain all the authentication and authorization
information (or at least it shouldn’t) so I’ll need to create a separate command
pathway that can execute the contents of the command journal but bypass
authentication and authorization. This isn’t too big a deal though because this is
probably just a layer on top of my existing command processing system. I’ll also,
however, need to either figure out how to disable billing (since I shouldn’t be charging
for my own recovery) or compensate for any billable events re-running the command
journal causes. And then of course there are side effects of issuing commands
like system alerts, e-mails, etc. I’ll need to identify all of those and either
disable or compensate for them as well. If I plan on replaying more than a few
commands I’ll also need to think about how to run the replay in parallel
in such a way that system state isn’t corrupted and nothing is run out of
order.

All of this is completely doable but it isn’t free and it’s an ongoing expense since
every change in the system’s command functionality will have to be evaluated
and potentially compensated for in the context of replaying the command
journal.

3 It costs serious money to implement a command journal replay facility
correctly

Let’s imagine that command X is received, processing of command X caused some
unknown number of state updates at which point the machine that was processing
command X failed.

Presumably the command journal was updated with command X before any of
the other processing could start (which means the command journal now adds an
extra internal round trip to at least all write commands). So we know that
command X was submitted. What we don’t know, especially since the machine

processing command X crashed, is what part of command X got implemented.
So when we replay command X what should we do? Let it succeed? Skip
it?

In theory one could argue it doesn’t matter. After all, if no one knows what state
the system is in aren’t we free to put the system in whatever set of states the
processing of command X could potentially have generated? The problem however is
what happens if someone either directly (by doing say a GET) or indirectly (by
issuing a command which depends on the values that were potentially affected by
command X) determined the state of the system? In that case the external actor is
making decisions based on the state the system is in as a consequence of command
X. So if we just replay command X without exactly replicating its actual
(as opposed to theoretical) output then the state the system is in and the
state people think the system is in will not be the same introducing more
bugs.

This challenge could also be overcome if we, for example, included not just all
commands in the command journal but also all responses to all commands in the
command journal. Then we could write a simulator that ran through all the
responses and determine if any of them directly or indirectly could tell us the values
that command X created. I’m trying hard not to think too much about the expense
of writing and maintaining that simulator as well as the cost of storing all that
data.

All of this having been said it is completely possible to build a correct command
journal. In certain restricted scenarios it might not even be that painful and could
potentially be extremely useful. But in the general case it looks like a nightmare to
me.

4 So what good is a command journal if we can’t replay it?

The scenario where I invoked the need for a command journal was when my service
incorrectly executed commands it received from users and caused data to be
corrupted. I wanted the command journal so I could look through the commands I
received and potentially identify commands that could have tripped the data
corruption bug. This would allow me to alert users who were most at risk from the
bug and give them pointers on what might have been damaged. But in the general
case when there is a bug all users need to be notified so we are just arguing if there is
a one size fits all notification or if certain users may get an additional notification
with more targeted information. How useful this facility is, is of course, context
dependent.

A But what if the command journal contained no failed commands?

Using two phase commit techniques it is theoretically possible to create a
system where if a command fails any state changes would roll back. An
no, two phase commit is not a sin in a distributed system. I’ve been on
this hobby horse before, but the bottom line is that one can reasonably
implement a two phase commit in a fully distributed system. But let’s say
I did that. Let’s say I have a full 2PC system so that all my failures are
well defined. I am still not sure that replaying the journal would actually
work.

The main reason is that over any non-trivial period of time I probably deployed
multiple versions of my website. So to fully replicate the actual behavior I would not
just have to replay the commands. I would have to replay the commands with the
exact version of the software that was used when the command was issued. In simple
cases where all aspects of the command were run on a single box then I could
just record the version the box was running in the command journal. But
in non-trivial cases multiple different boxes potentially running different
versions of the software (especially if I use a rolling upgrade system) could have
been involved. I would need as part of the command journaling process to
take a census of all of their versions and which parts of the command they
handled.

Again, this is all doable. But my guess is that by the time I’ve implemented the
2PC logic with roll back, the version census system and built some kind of framework
to host multiple simultaneous versions of my software I’ve probably already gone out
of business from cost overruns.

So while, again, I think there are certain limited scenarios where a re-playable
command journal is conceivable I don’t think I’ll be building one any time soon for
any of my services.