Recovering from self inflicted data corruption – a summary

Of late I have been torturing myself about the question of - even if I build on top of a highly reliable storage service like Windows Azure Table Service do I still need to worry about backups, versioning, journals and such? The answer would seem to be, yes, I do. Mostly because even if the table store works perfectly, I’m still going to have bugs I introduced that are going to hork my data.

In fact what I specifically need to do is:

  1. Lobby the Windows Azure Table Storage team to add undelete for tables so if I accidentally blow away one of my tables I have some hope (oh and ACL’s would be nice too)
  2. Be very careful about how I update my schemas
  3. Implement a command journal (and be clear about their limitations)
  4. If time permits implement tombstoning
  5. If I’m feeling really wacko implement my own versioning system on top of the table store (or just backups if I’m feeling only slightly wacko)
  6. Put into place a realistic plan to take advantage of all these features while keeping in mind the limitations of these techniques.

The links in the previous text are to the other articles in this series that I wrote for my blog. Those articles are:

Implementing Versioning in Windows Azure Table Store

In a previous article I argued that I needed some kind of journaling/backup for my Windows Azure Tables in order to handle my own screw ups. In this article I re-examine the value of versioning for recovering from self inflicted data corruption. Discuss backups as a possible substitute for versioning. Look at what versioning might look like if added as a native feature of Windows Azure Table Store and finish up by proposing a design that would let me implement versioning on top of Windows Azure Table Store.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading Implementing Versioning in Windows Azure Table Store

The limits of recovering from application logic failures

I have been blathering on all week about how to prepare for application logic failures in services and how to potentially recover from the damage those errors cause. I have yammered on about command journals (twice), tombstones, versioning etc. But none of these techniques is magical. They all have very serious limits that mean in most non-trivial cases the best one can really do is say to the user ”Here is the command I screwed up, here are the specific mistakes made, here is what the values should have been, do you want to repair this damage?” Below I explore three specific examples of those limits that I call: read syndrome, put syndrome and e-tag effect.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading The limits of recovering from application logic failures

Tombstoning on top of Windows Azure Table Store

After command journaling probably the next most effective protection against application logic errors is tombstoning (keeping a copy of the last version of a deleted row). In this article I propose a design for adding tombstoning to Windows Azure Table Store using two tables, a main table and a tombstone table.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading Tombstoning on top of Windows Azure Table Store

Thoughts on implementing a command journal

I had previously concluded that command journaling (creating a journal of all the external user commands and internal maintenance commands I issue) is really useful for recovering from self inflicted data corruption. In this article I look into the various techniques I can use to implement a command journal so as to trade off between system performance and the journal’s utility in recovery.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading Thoughts on implementing a command journal

Techniques to Ease Recovering from Self Inflicted Data Corruption

In a previous article I argued that even with the protections Windows Azure Table Store provides for my data I can still screw things up myself and so need to put in place protections against my own mistakes. Below I walk through the three scenarios I previously listed and explain how command journaling, tombstoning and versioning could make recovering from my errors much easier.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading Techniques to Ease Recovering from Self Inflicted Data Corruption

The Limits of Command Journals

In a previous article I argued that I needed some kind of journaling/backup for my Windows Azure Tables in order to make it easier for me to recover from my own screw ups. One type of journaling I suggested was command journaling. In this article I look at the practical limitations of command journals and conclude that while they are (somewhat) useful for notifying users who might have been affected by data corruption they aren’t likely in the general case to be re-playable so their real value is probably less than it might appear.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading The Limits of Command Journals

Do I need to backup/journal my Windows Azure Table Store?

Windows Azure provides a highly scalable, reliable, fault resistent table store. So in theory my service can dump data into the table store and walk away secure in the knowledge that I’ll get back what I put in and that the data will be there when I need it. So is there any reason I should care about backing up or journaling my Windows Azure Tables? As I argue below the answer is - yes. But the reason isn’t to protect me against Azure’s mistakes, it’s to protect me from myself.

This article is part of a series. Click here to see summary and complete list of articles in the series.

Continue reading Do I need to backup/journal my Windows Azure Table Store?

What do program managers on the Cosmos team do anyway?

In previous articles (here and here) I have talked about what software program managers do. And in another previous article I talked about Cosmos. In this article I bring the two topics together and talk about what Cosmos program managers actually do. (For those just joining us Cosmos is Microsoft's internal platform for reliably storing and processing petabytes of information such as all of Microsoft's log data from its various websites.) The issue of what PMs on the Cosmos team do is near and dear to my heart because I'm the lead program manager for Cosmos and we are hiring!

Continue reading What do program managers on the Cosmos team do anyway?

What is Microsoft's Cosmos service?

Cosmos is Microsoft's internal data storage/query system for analyzing enormous amounts (as in petabytes) of data. As the lead Program Manager for Cosmos I can't say too much about it but what I can do is take a tour of the information that Microsoft has published about Cosmos. So read on if you are interested in the architecture Microsoft uses to store and query petabytes of data and what technical issues Microsoft's approach brings up.

Continue reading What is Microsoft's Cosmos service?