The CAP theorem and modern data centers – for now, choose consistency!
The dominance of the commodity machine model for data centers is
so complete that one forgets that there was ever any other viable
choice. But IBM, for one, is still selling lots of mainframes.
Nevertheless the world I live in is built on top of data centers that
contain a lot of commodity class machines. These machines have a
nasty habit of failing on a fairly regular basis. So when I think
about the CAP theorem I think about it in the context of a data
center filled with a bunch of not completely reliable boxes.
In that case partition tolerance (which, as I explain below, ends
up meaning tolerance of machine failure) is a requirement. So in
designing frameworks for the data centers I work with the CAP theorem
makes me choose between exactly two choices - do I want consistency
or availability?
My belief is that for the vast majority of developers, at least
for the immediate future, they need to choose consistency.
(more...)
Recovering from self inflicted data corruption – a summary
Friday January 01st 2010, 10:03 pm
Filed under:
SOA/Web/Etc.
Of late I have been torturing myself about the question of - even if I build on top
of a highly reliable storage service like Windows Azure Table Service do I still
need to worry about backups, versioning, journals and such? The answer
would seem to be,
yes, I do. Mostly because even if the table store works
perfectly, I’m still going to have bugs I introduced that are going to hork my
data.
In fact what I specifically need to do is:
- Lobby the Windows Azure Table Storage team to add undelete for tables
so if I accidentally blow away one of my tables I have some hope (oh and
ACL’s would be nice too)
- Be very careful about how I update my schemas
- Implement a command journal (and be clear about their limitations)
- If time permits implement tombstoning
- If I’m feeling really wacko implement my own versioning system on top of
the table store (or just backups if I’m feeling only slightly wacko)
- Put into place a realistic plan to take advantage of all these features while
keeping in mind the limitations of these techniques.
The links in the previous text are to the other articles in this series that I wrote for my
blog. Those articles are:
Implementing Versioning in Windows Azure Table Store
In a previous article I argued that I needed some kind of
journaling/backup for my Windows Azure Tables in order to handle my
own screw ups. In this article I re-examine the value of versioning for
recovering from self inflicted data corruption. Discuss backups as a possible
substitute for versioning. Look at what versioning might look like if added
as a native feature of Windows Azure Table Store and finish up by
proposing a design that would let me implement versioning on top of
Windows Azure Table Store.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
The limits of recovering from application logic failures
I have been blathering on all week about how to prepare for application
logic failures in services and how to potentially recover from the damage
those errors cause. I have yammered on about command journals (twice),
tombstones, versioning etc. But none of these techniques is magical. They
all have very serious limits that mean in most non-trivial cases the best
one can really do is say to the user ”Here is the command I screwed up,
here are the specific mistakes made, here is what the values should have
been, do you want to repair this damage?” Below I explore three specific
examples of those limits that I call: read syndrome, put syndrome and
e-tag effect.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
Tombstoning on top of Windows Azure Table Store
Thursday December 31st 2009, 4:32 pm
Filed under:
SOA/Web/Etc.
After command journaling probably the next most effective protection
against application logic errors is tombstoning (keeping a copy of the last
version of a deleted row). In this article I propose a design for adding
tombstoning to Windows Azure Table Store using two tables, a main table
and a tombstone table.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
Thoughts on implementing a command journal
Wednesday December 30th 2009, 2:39 pm
Filed under:
SOA/Web/Etc.
I had previously concluded that command journaling (creating a
journal of all the external user commands and internal maintenance
commands I issue) is really useful for recovering from self inflicted data
corruption. In this article I look into the various techniques I can use
to implement a command journal so as to trade off between system
performance and the journal’s utility in recovery.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
Techniques to Ease Recovering from Self Inflicted Data Corruption
Monday December 28th 2009, 12:02 pm
Filed under:
SOA/Web/Etc.
In a previous article I argued that even with the protections Windows
Azure Table Store provides for my data I can still screw things up myself
and so need to put in place protections against my own mistakes. Below
I walk through the three scenarios I previously listed and explain how
command journaling, tombstoning and versioning could make recovering
from my errors much easier.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
The Limits of Command Journals
Wednesday December 23rd 2009, 6:32 pm
Filed under:
SOA/Web/Etc.
In a previous article I argued that I needed some kind of
journaling/backup for my Windows Azure Tables in order to make it
easier for me to recover from my own screw ups. One type of journaling
I suggested was command journaling. In this article I look at the
practical limitations of command journals and conclude that while they
are (somewhat) useful for notifying users who might have been affected
by data corruption they aren’t likely in the general case to be re-playable
so their real value is probably less than it might appear.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
Do I need to backup/journal my Windows Azure Table Store?
Tuesday December 22nd 2009, 5:25 pm
Filed under:
SOA/Web/Etc.
Windows Azure provides a highly scalable, reliable, fault resistent table
store. So in theory my service can dump data into the table store and
walk away secure in the knowledge that I’ll get back what I put in and
that the data will be there when I need it. So is there any reason I should
care about backing up or journaling my Windows Azure Tables? As I
argue below the answer is - yes. But the reason isn’t to protect me against
Azure’s mistakes, it’s to protect me from myself.
This article is part of a series. Click here to see summary and complete list of articles in the series.
(more...)
What do program managers on the Cosmos team do anyway?
In previous articles (here and here) I have talked about what software program managers do. And in another previous article I talked about Cosmos. In this article I bring the two topics together and talk about what Cosmos program managers actually do. (For those just joining us Cosmos is Microsoft's internal platform for reliably storing and processing petabytes of information such as all of Microsoft's log data from its various websites.) The issue of what PMs on the Cosmos team do is near and dear to my heart because I'm the lead program manager for Cosmos and we are hiring!
(more…)