Techniques to Ease Recovering from Self Inflicted Data Corruption

In a previous article I argued that even with the protections Windows Azure Table Store provides for my data I can still screw things up myself and so need to put in place protections against my own mistakes. Below I walk through the three scenarios I previously listed and explain how command journaling, tombstoning and versioning could make recovering from my errors much easier.

This article is part of a series. Click here to see summary and complete list of articles in the series.

1 Application logic failure
1.1 Do nothing extra
1.2 Add command journaling
1.3 Add tombstoning
1.4 Add versioning
1.5 There’s no such thing as a free lunch
2 Table deletion
3 Schema update failure

1 Application logic failure

An application logic failure means my own application logic was supposed to perform some action, screwed it up and ended up corrupting (which I define in the most generic sense of - didn’t do what it should have done) my Windows Azure Table Store. I generally model application logic failures as my service receiving request X (which might even have been generated internally) and doing the wrong thing. So let’s say that I find out that I have an application logic failure. What should I have done before hand to making recovering from the error easier?

1.1 Do nothing extra

The bulk of my testing is focused on this kind of error so I’m already spending lots of time and money trying to prevent application logic failures, is it really cost effective for me to do more? Below I’ll discuss strategies like journaling, tombstoning and versioning but those techniques require time and money to implement and maintain. So it’s not unrealistic, especially if the data my service is managing is either derived from someplace else and thus recoverable through external means or if it’s reasonably low value, to do nothing more than I do now which is test like crazy.

1.2 Add command journaling

I get a bug report that my system is corrupting data in my Windows Azure Tables. I investigate and determine that a POST containing a certain kind of JSON is going to do the wrong thing. I need to notify users who were affected by this bug so they can begin the process of dealing with the consequences.

Today the best I can do in this situation is send out a notice to all of my users and wish them the best of luck. I have no idea who issued the POST with the JSON body that could trigger the bug.

But a fairly simple feature to implement that could help me out is command journaling. A command journal is a log of every command given to my system. Most of these commands will come from my users but some will also be generated internally as part of maintenance operations.

If I had a command journal then I could do a search through the journal looking for commands that would trigger the bug, see which account issued that command and then notify that account. With a bit of extra effort (depending on the nature of the bug) I might even be able to suggest what the proper fix is. But I don’t want to oversell the capabilities of a command journal. As I discussed in a dedicated article on the topic there are significant limits to how a command journal can be used in real world situations.

So I tend to think of command journals on their own as a way to identify potentially problematic commands and so hopefully winnow down the users and the data that needs to be examined in order to recover from the bug.

1.3 Add tombstoning

The previously discussed limitations of command journaling make recovering from some fairly simple bugs harder than it needs to be. For example, let’s say I have a bug that instead of updating a row ends up deleting it instead. Trying to recover the lost data using a command journal would, in the general case, require replaying some or all of the journal. For the reasons explained in the previously linked article I don’t think that’s realistic.

So this means once data is deleted from my Windows Azure Table Store that’s it, it’s gone which makes recovering from a typical delete bug pretty much impossible. A reasonably simple solution to this problem is called Tombstoning. This is a technique whereby information isn’t deleted, instead it is marked as deleted. And what is marked as deleted can always be unmarked later if necessary.

1.4 Add versioning

Now let’s say I have a really nasty bug where it turns out that in some cases I used an object across threads that wasn’t actually thread safe. Rather than just crashing nicely the shared object caused data corruption. I might have a shot of identifying which commands could run into the problem and then looking in the data store to see if the value written in the data store is the value I would expect from the command.

Except if I have a disconnect between the value I was expecting and the value that was present I have a conundrum. The disconnect could be because the bug manifested itself or it could simply be that someone later came along and overwrote the potentially wrong value. How can I tell the difference? In theory I could replay the command journal and see if anyone ever issued any command that would alter the suspect value(s) but as previously discussed I don’t think that’s realistic in the general case.

Another technique would be to use some kind of command ID for each command in the command journal and then mark any updates with that ID. But that wouldn’t handle the case where someone just blindly wrote back the same value that they previously read in. This would look like an update (since the command ID would be different) but in fact it isn’t.

Another alternative is table versioning. Imagine if every row in my Windows Azure Table Store was versioned. I could find the version in the table store that contained the value written by the command, see if it matches what the command should have done, if it doesn’t then I can look to see if there are any subsequent updates to that row. If not then I know I have an error condition and can either fix it or at least tell the user which data in which location is problematic.

1.5 There’s no such thing as a free lunch

All of the previous features can be implemented, today, over the Windows Azure Table Store. If time permits I hope to write a few articles explaining how to do so in ways that are scalable and don’t have huge performance penalties. But for now if I want these features I have to implement them myself. So I’ll have to make a service by service call if it’s worth the effort.

In the long run however I hope to see these features available on top of Windows Azure Table Store. Once these are off the shelf functionality the math on which ones to use changes significantly over having to implement them myself.

2 Table deletion

Another self inflicted wound I discussed in my previous article was accidentally deleting one of my Windows Azure Tables. This is incredibly easy to do in Windows Azure Tables, just one HTTP DELETE will do it. As previously explained I don’t believe that Command Journals can be relied upon to recover from total data loss because I don’t feel comfortable that I can replay the journal (and even if I could I doubt I could afford the time and resources necessary to do so). So the only strategy that might help me here is versioning.

But really I don’t think that versioning is the right strategy here either. I think the right strategy is talking to the Windows Azure Table Store team and getting them to do two things:

Implementing Undelete - We need a undelete command along with some kind of guarantee about how long a table will be allowed to remain undeleted before being garbage collected.
Add ACLs - Right now every component I have that has any reason to interact with my Windows Azure Table Store can do everything up to and including deleting the table. I would love to have an ACL system so I can lock down components to just the features they need to do their job so the scope of their screw ups is reduced.

If these are features you would like to see in Windows Azure Table Store then you need to let Microsoft know. I believe one way to do that is to go vote on www.mygreatwindowsazureidea.com. Jamie Thomson started a vote to ask for journaling for Azure Table Store. Personally I’d rather see that modified to ask for a versioning interface. In theory it’s really easy to replay a table journal (which unlike a command journal just contains simple CRUD commands limited to a single table) but in practice there are versioning and other issues that can get into the way (e.g. if Windows Azure Table Store changes/enhances its logic in any way over time). If we had a versioning store instead then we wouldn’t care. The difference is between recording ’before’ (a table journal) and ’after’ (a versioning store). ’after’ is easier to deal with. But whatever, this can all get figured out if the basic idea of having some kind of versioning/journal story gets adopted by Azure. So if you believe, vote!

3 Schema update failure

The last self inflicted wound from my previous article was screwing up a schema update. This is when I change the structure and meaning of my tables in a non-backwards compatible way. This is an area rich in potential for data corruption.

I generally won’t do a schema update in place. In other words if I need to make a non-backwards compatible change to a table(s)’ schema the way I’m going to do it is to create a completely new set of tables that are set up using the new schema. Then typically I’m going to tell my users ”I’ll support the old service on the old tables for X months then retire it, if you want to be on the new system you need to move your data to the new system.” I will, of course, provide tools to help with the transfer but this is one of those things that I think has to be left to the end user. But even if I’m forced to handle moving the data myself I will still use a model where a user is required to say they want to move because once they do move their old data won’t be available in the V1 system any longer. They and all their users will have to move.

So at that point moving the user is really just an application logic scenario where the initial command is ’move data from table A to table B’. Now I can model error recovery using the same techniques I previously discussed for application logic failure.

If, on the other hand, I have to support accounts on both V1 and V2 simultaneously then I doubt a breaking schema change is feasible. See my article on versioning Web Services for more information on my thinking in this area.

3 thoughts on “Techniques to Ease Recovering from Self Inflicted Data Corruption”

Jamie Thomson says:

December 28, 2009 at 4:10 pm

Hi Yaron,
Thanks for your continuted discussion of this topic, I’ll continue pointing people back here as long as you keep talking about it!

Maybe you could add some comments to my submission to http://www.mygreatwindowsazureidea.com/pages/34192-windows-azure-feature-voting explaining your thoughts or perhaps simple just linking back to your blog posts here.

Great stuff. Thanks again.

-Jamie

1. Administrator says:
  
  December 30, 2009 at 10:18 am
  
  I’m not completely sure it’s appropriate for a Microsoft employee to vote there. I have sent mail to the person who owns that site internally and asked them if it’s o.k. But due to the holidays I expect some delay in a response.
  
  BTW, for some reason the link you used above doesn’t seem to actually link to your vote even though it looks like it should. The much longer link I used in the article (http://www.mygreatwindowsazureidea.com/pages/34192-windows-azure-feature-voting/suggestions/426220-build-a-journaling-system-for-azure-table-storage?ref=title) does seem to work though.
  
  I really appreciate you commenting and pointing people to the articles!
  
Jamie Thomson says:

December 30, 2009 at 10:59 am

My pleasure. I’ve always enjoyed reading your stuff ever since I first came across your W3S post a couple of years ago so I’m pleased to see you’re now talking about Azure which is something I’m very interested in.