The CAP theorem and modern data centers – for now, choose consistency!
Friday February 05th 2010, 2:33 pm
Filed under: SOA/Web/Etc., Uncategorized

The dominance of the commodity machine model for data centers is so complete that one forgets that there was ever any other viable choice. But IBM, for one, is still selling lots of mainframes. Nevertheless the world I live in is built on top of data centers that contain a lot of commodity class machines. These machines have a nasty habit of failing on a fairly regular basis. So when I think about the CAP theorem I think about it in the context of a data center filled with a bunch of not completely reliable boxes.

In that case partition tolerance (which, as I explain below, ends up meaning tolerance of machine failure) is a requirement. So in designing frameworks for the data centers I work with the CAP theorem makes me choose between exactly two choices - do I want consistency or availability?

My belief is that for the vast majority of developers, at least for the immediate future, they need to choose consistency.

(more...)

Recovering from self inflicted data corruption – a summary
Friday January 01st 2010, 10:03 pm
Filed under: SOA/Web/Etc.

Of late I have been torturing myself about the question of - even if I build on top of a highly reliable storage service like Windows Azure Table Service do I still need to worry about backups, versioning, journals and such? The answer would seem to be, yes, I do. Mostly because even if the table store works perfectly, I’m still going to have bugs I introduced that are going to hork my data.

In fact what I specifically need to do is:

  1. Lobby the Windows Azure Table Storage team to add undelete for tables so if I accidentally blow away one of my tables I have some hope (oh and ACL’s would be nice too)
  2. Be very careful about how I update my schemas
  3. Implement a command journal (and be clear about their limitations)
  4. If time permits implement tombstoning
  5. If I’m feeling really wacko implement my own versioning system on top of the table store (or just backups if I’m feeling only slightly wacko)
  6. Put into place a realistic plan to take advantage of all these features while keeping in mind the limitations of these techniques.

The links in the previous text are to the other articles in this series that I wrote for my blog. Those articles are:



Implementing Versioning in Windows Azure Table Store
Friday January 01st 2010, 9:43 pm
Filed under: SOA/Web/Etc.

In a previous article I argued that I needed some kind of journaling/backup for my Windows Azure Tables in order to handle my own screw ups. In this article I re-examine the value of versioning for recovering from self inflicted data corruption. Discuss backups as a possible substitute for versioning. Look at what versioning might look like if added as a native feature of Windows Azure Table Store and finish up by proposing a design that would let me implement versioning on top of Windows Azure Table Store.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

The limits of recovering from application logic failures
Friday January 01st 2010, 8:29 pm
Filed under: SOA/Web/Etc.

I have been blathering on all week about how to prepare for application logic failures in services and how to potentially recover from the damage those errors cause. I have yammered on about command journals (twice), tombstones, versioning etc. But none of these techniques is magical. They all have very serious limits that mean in most non-trivial cases the best one can really do is say to the user ”Here is the command I screwed up, here are the specific mistakes made, here is what the values should have been, do you want to repair this damage?” Below I explore three specific examples of those limits that I call: read syndrome, put syndrome and e-tag effect.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

Tombstoning on top of Windows Azure Table Store
Thursday December 31st 2009, 4:32 pm
Filed under: SOA/Web/Etc.

After command journaling probably the next most effective protection against application logic errors is tombstoning (keeping a copy of the last version of a deleted row). In this article I propose a design for adding tombstoning to Windows Azure Table Store using two tables, a main table and a tombstone table.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

Thoughts on implementing a command journal
Wednesday December 30th 2009, 2:39 pm
Filed under: SOA/Web/Etc.

I had previously concluded that command journaling (creating a journal of all the external user commands and internal maintenance commands I issue) is really useful for recovering from self inflicted data corruption. In this article I look into the various techniques I can use to implement a command journal so as to trade off between system performance and the journal’s utility in recovery.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

Techniques to Ease Recovering from Self Inflicted Data Corruption
Monday December 28th 2009, 12:02 pm
Filed under: SOA/Web/Etc.

In a previous article I argued that even with the protections Windows Azure Table Store provides for my data I can still screw things up myself and so need to put in place protections against my own mistakes. Below I walk through the three scenarios I previously listed and explain how command journaling, tombstoning and versioning could make recovering from my errors much easier.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

The Limits of Command Journals
Wednesday December 23rd 2009, 6:32 pm
Filed under: SOA/Web/Etc.

In a previous article I argued that I needed some kind of journaling/backup for my Windows Azure Tables in order to make it easier for me to recover from my own screw ups. One type of journaling I suggested was command journaling. In this article I look at the practical limitations of command journals and conclude that while they are (somewhat) useful for notifying users who might have been affected by data corruption they aren’t likely in the general case to be re-playable so their real value is probably less than it might appear.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

Do I need to backup/journal my Windows Azure Table Store?
Tuesday December 22nd 2009, 5:25 pm
Filed under: SOA/Web/Etc.

Windows Azure provides a highly scalable, reliable, fault resistent table store. So in theory my service can dump data into the table store and walk away secure in the knowledge that I’ll get back what I put in and that the data will be there when I need it. So is there any reason I should care about backing up or journaling my Windows Azure Tables? As I argue below the answer is - yes. But the reason isn’t to protect me against Azure’s mistakes, it’s to protect me from myself.

This article is part of a series. Click here to see summary and complete list of articles in the series.

(more...)

What do program managers on the Cosmos team do anyway?
Friday July 18th 2008, 12:00 am
Filed under: SOA/Web/Etc.

In previous articles (here and here) I have talked about what software program managers do. And in another previous article I talked about Cosmos. In this article I bring the two topics together and talk about what Cosmos program managers actually do. (For those just joining us Cosmos is Microsoft's internal platform for reliably storing and processing petabytes of information such as all of Microsoft's log data from its various websites.) The issue of what PMs on the Cosmos team do is near and dear to my heart because I'm the lead program manager for Cosmos and we are hiring!

(more…)