The limits of recovering from application logic failures

I have been blathering on all week about how to prepare for application logic failures in services and how to potentially recover from the damage those errors cause. I have yammered on about command journals (twice), tombstones, versioning etc. But none of these techniques is magical. They all have very serious limits that mean in most non-trivial cases the best one can really do is say to the user ”Here is the command I screwed up, here are the specific mistakes made, here is what the values should have been, do you want to repair this damage?” Below I explore three specific examples of those limits that I call: read syndrome, put syndrome and e-tag effect.

This article is part of a series. Click here to see summary and complete list of articles in the series.

1 Read syndrome
2 Put syndrome
3 Etag effect

1 Read syndrome

Lets say there is a buggy command that incorrectly changes some piece of data. Later the bug is discovered and so now there is a desire to fix the damage. However between the time that the buggy command was issued and the time the bug was discovered someone may have read the incorrect data and made decisions based on that incorrect data. Those decisions could then be used to update other parts of the system or systems external to the one with the bug.

For example, let’s say that due to a bug the title of some object, recorded as a column in a row, was written out as ”bar” when it should have been ”foo”. After the buggy command someone else comes along, reads the title, sees that it is ”bar” and starts writing out ”bar” in other locations as a pointer to the object. If I now change the column value from ”bar” back to the intended value ”foo” I will break those links and cause damage. In essence the wrong value, to a certain extent, has become the ’right’ value.

To detect the possibility of read syndrome I need to, at the very least, record the last time someone read a particular value. I also need to track down reads on any commands that could leak, directly or indirectly, the incorrect state. But even if I had all of this data I cannot, in the general case, tell the difference between a read that led to action and read that did not. So once again the best I can do in the general case is lay the facts before the user and let them decide how to compensate.

2 Put syndrome

It is common for clients who wish to update the state of a system to first read in the state of the system, locally change the parts they want updated and then upload the entire state back to the system. A classic example of this is using PUT against the table store. PUT replaces everything in the row it is pointed at. So if one uses PUT (rather than MERGE) then one has to read in the entire state of the row and then upload the entire state including desired changes. This behavior complicates recovering from data corruption.

If a row has been updated after a buggy command is issued with the same buggy value there is an ambiguity. Did the user intend the buggy value (perhaps in the sense of the read syndrome defined above) to now be the correct value or was the user just using replacement style PUT logic? Generally one can’t tell the difference and so one has to ask the user to clarify intent.

Note that particularly with bugs that produce unpredictable output, versioning can be very useful here. With versioning one can see exactly what value the buggy command actually outputted and then see if the next update used the same value or not. If the value is not the same then one more or less has to assume the update is meaningful. If the value is the same then the ambiguity exists. But at least with versioning one can reduce the number of ambiguous cases.

3 Etag effect

A request is sent to a service to update some values. There is a logic bug and the update is mangled. This is later discovered and the suspect command identified from the command journal. But the command included an if-match or similar header that predicated the command’s execution on a specific e-tag or other condition. It’s tempting to argue that the error should just be fixed but this can only apply if the system conditions are the same as those described in the e-tag. Otherwise the change could just end up causing even more damage.

So in theory in order to fix the damage first one must determine if the system state is the same as the one in the etag. But even assuming that the etag is directly taken from the table store (unlikely for all but the most trivial systems) the put syndrome means its easy to be fooled into thinking the system state has changed when it has not. At best what one can do is use a versioning system to see what state the etag represented and then determine if the system is still in the same state. If so then perhaps the fix can be applied. Assuming, of course, that the user still wants the value set that way at this point in time. So again, even assuming a versioning system is available, at best what one can do is explain the situation to the user, provide the context information and let the user make a decision.

Note that the etag effect applies even if there is no etag. When a user issues a command they do so in a certain context whose parts may involve information outside the knowledge of the service itself. Undoing an erroneous command without knowledge of that context can potentially cause more damage than it fixes so once again, in the general case, one must consult the user.

The limits of recovering from application logic failures

Contents

1 Read syndrome

2 Put syndrome

3 Etag effect

Leave a Reply Cancel reply