Tombstoning on top of Windows Azure Table Store

After command journaling probably the next most effective protection against application logic errors is tombstoning (keeping a copy of the last version of a deleted row). In this article I propose a design for adding tombstoning to Windows Azure Table Store using two tables, a main table and a tombstone table.

This article is part of a series. Click here to see summary and complete list of articles in the series.

1 In place - the traditional tombstone design and why I don’t like it
2 An alternative approach - two tables
3 Don’t forget command journal IDs
4 Isn’t tombstoning just soft deletes?
5 Is tombstoning worth the effort?
A Racing Deletes
A.1 The problem
A.2 Why I can live with this
B Racing row keys
B.1 The problem
B.2 Why I can live with this

1 In place - the traditional tombstone design and why I don’t like it

Traditionally tombstones are implemented by adding a column to a table whose semantic is ”If I’m marked true then this row has been deleted.” But if I implement tombstoning using this approach on top of Windows Azure Table Store then I have to intercept all methods going to the table to make sure they interact properly with tombstoned rows. For example should POST’ing to a row that is tombstoned fail because the row ’exists’? Probably not. But that won’t be the native behavior of the table store. All GETs in particular also have to be intercepted to make sure that they don’t return tombstones unless they were explicitly marked as wanting to do so. This level of intervention convinces me that implementing tombstones the traditional way on top of the Windows Azure Table Store where I have multiple parallel writers and no equivalent of stored procedures is probably not a great idea.

2 An alternative approach - two tables

In the two table approach I have my main table and a separate tombstone table. When a row is deleted it is first copied over to a tombstone table and only then deleted from the main table. This approach only requires that the DELETE method be intercepted. Other methods can be left alone since they will only hit the main table.

Although there are a few ways to implement the DELETE logic in the face of tombstoning probably the most robust is a simple serial implementation where:

A GET is issued to the main table row (use the if-match header if an etag was going to be used on the original delete)
A PUT is then issued to the tombstone table with if-match set to ”*”. The contents of the PUT should be identical to the GET including partition and row keys.
A DELETE is then issued to the row in the main table (using the if-match header if an etag was going to be used on the original delete). If the DELETE fails then issue a GET to the main table to see if the row is still there (failures come in many flavors). If the row is gone then declare success and go home. Otherwise it would be nice (but not strictly necessary) to delete the row created in the PUT in step 2 and retry the three steps.

It’s worth pointing out that the tombstone table is not a versioning table. It’s not there to record every time a row was deleted. In practice it will sometimes record deletion information about rows that have been subsequently recreated but that is just a side effect and not a feature.

While the three step approach is reasonably robust it can still have consistency issues if the PUT succeeds but the DELETE doesn’t, for example the machine crashed after issuing the PUT but before the DELETE.

But the damage from this kind of situation is minimal since the main table is the ’truth’. If the main table says a row exists then it exists, end of story.

If I was feeling brave (or bug prone) I could intercept MERGE/POST/PUT in addition to DELETE via my own library so that when I create a row I look to see if it has a tombstone and if so I delete the tombstone. This could even be done as a lazy process that doesn’t block the main action. But it isn’t clear to me that it’s worth the effort. So long as I always check the main table before the tombstone table when doing clean up operations then having a row both in the main table and the tombstone table won’t cause confusion.

Cleaning up tombstones if they start eating up too much disk space is also a low risk since a screw up in deleting tombstone entries will harm the tombstone table but leave the main table untouched.

3 Don’t forget command journal IDs

In the most simple application recovery scenario I think a particular row was accidentally deleted. I look the row up and it doesn’t exist in the main table. So now I go to the tombstone table where I find an entry for the row and so undelete it.

But the undelete might have been an error. For example, let’s say that after the row was accidentally deleted the user noticed the incorrect deletion and recreated the row themselves and then later deleted the row for their own reasons. If I implemented the logic in the previous paragraph I would be bringing a row back from the dead that the user didn’t want.

To prevent this kind of scenario I need to mark the entries in my command journal with IDs and then use those IDs in the tombstone table. That way I can see if the last delete was caused by the buggy command. If not, I do nothing.

4 Isn’t tombstoning just soft deletes?

A soft delete is a delete in which one marks an entry as deleted rather than actually deleting it. Sound familiar? Dare Obasanjo recently posted a blog entry about soft deletes in which he mostly talked against them. I actually agree with his general argument and believe that keeping data around forever (cost and privacy policy permitting) is a good thing but as Dare’s article points out it’s probably better to handle keeping data around forever by design instead of ’hiding’ things behind soft deletes.

However the soft delete example and tombstoning, while identical in effect, are different in intention. Tombstones are meant exclusively as a recovery mechanism for application logic failures. They are not intended as a way to keep data around forever. In other words if a user accidentally deletes something they should not my system will not be set up to undo the damage using the tombstone table. Rather the tombstone table will only be called upon if my application logic is faulty and I incorrectly deleted something that should not have been deleted.

5 Is tombstoning worth the effort?

As a service implementer I have an almost primal fear of deletion because there is no way back. And I can’t and don’t want to get rid of deletion. Even if I had no economic reason to ever delete things there are extremely good privacy and security reasons to delete things. So I want to stay in the deletion business, I’d just like it to be a bit less dangerous. A tombstone table buys me some breathing room if I screw up. Sorta.

The ’sorta’ is because in most interesting cases just undeleting a value isn’t enough. Interesting rows tend to point at each other which brings up fun problems if I just randomly undelete some row which contains links to other rows. Are those links still correct? How do I validate them? It’s not enough to just implement a tombstone system. One also has to think carefully about how rows can be safely undeleted without creating more chaos. Unfortunately it’s hard to generalize about how difficult the ’undelete’ process will be without getting into the specifics of particular services. But even so, in the worse case, if I have the tombstone table at least I have something to undelete. Without it, I have nothing.

Ideally I would rather the Windows Azure Table Store implement tombstoning (or better yet, full versioning). But they don’t do that currently and building my own tombstoning system isn’t too hard (think it would make a good open source project?) and isn’t too dangerous to the integrity of my system. So if after implementing a command journal I’m still feeling paranoid then the next thing I would look at is tombstoning.

A Racing Deletes

A.1 The problem

The three step process outlined above still suffers from race conditions beyond those previously described. For example, imagine that two people are simultaneously trying to delete the same row. Let’s call them command A and command B who are both trying to delete row alpha.

Command A does a GET on row alpha in the main table
Command B does a GET on row alpha in the main table
Command A does a PUT on row alpha in the tombstone table and puts in command ID A
Command B does a PUT on row alpha in the tombstone table and puts in command ID B
Command A deletes row A in the main table
Command B tries to delete row A in the main table but fails because it’s already deleted
Command B issues a GET for row A in the main table, sees it’s gone and declares victory

At some time later I determine that command B was buggy. It actually shouldn’t have deleted row alpha. But due to the ordering of the above the tombstone is going to show command B as the last command and so I will undelete row alpha. But in fact there was a separate, non-buggy, command, command A that also wanted the row deleted and actually got there first! So the correct action is to not undelete the row. But I have no record of command A in the tombstone and so I don’t recognize what happened.

Note, BTW, that even if both commands used etags the race condition would still happen exactly as described.

A.2 Why I can live with this

Race conditions are the bane of all loosely coupled systems. It goes with the territory. And in most cases I just live with them. Stuff is going to go wrong. Not because the code is necessarily wrong but because the cost of preventing these errors exceeds the benefit therein derived. For example, to prevent the previously described race condition in a robust way I would have to introduce some form of locking or at least serialization using two phase commit. Both techniques have significant downsides. So sometimes I just accept that stuff is going to break. This is one of those examples and a good reason why I probably should never automatically recover data but rather should notify users and present them with the data I think is wrong and what changes I think would fix it but let them make the final call on recovery.

B Racing row keys

B.1 The problem

Let’s imagine I have a row that has a randomly generated row key but can be uniquely identified by several of its columns. If the user causes the row to be deleted then a tombstone entry will be created with the row instance’s random row key. Later on I realize that the delete was a mistake due to an application bug. To track down the delete I check the command journal but the journal isn’t likely to tell me the row key (which is random anyway) since this isn’t typically the kind of things I tell my users so they wouldn’t have the row key in their command and it therefore wouldn’t be in the journal. So now I need to do a search on the unique columns to see if the row (whose row key I don’t know) exists. If I don’t find it in the main table then I need to search the tombstone table to see who deleted it.

When I do the search I get back two results that both contain the column values I’m looking for. The reason being that the row was created and deleted twice. Once by accident due to my bug and once intentionally and correctly. The question I now have to answer is - what order did the deletes occur in?

If the buggy delete happened first then no further action is needed.

If the buggy delete happened second then I may need to recreate the tombstoned row.

But how the heck do I figure out the ordering? I could look at the time stamp on each of the two tombstones (which is actually prohibited by Azure) and see their ordering but that is misleading since the timestamps represent when the tombstones were created not when the original entries were created and thus the entries could have been created/deleted in a different order than appears in the tombstone table.

Another approach is to look at the command journal ID on the tombstone entries and then check the command journal to see what order those IDs appear in. But that is also potentially misleading, especially if the two command journal entries are in different partitions. This can mean that the entries were created on different machines in Azure and thanks to the wonders of clock skew its possible for there to be small differences in clocks that can cause the real ordering to be flipped.

B.2 Why I can live with this

For this scenario to happen I have to have a row that uses a random row key. If a row used a non-random row key then if the same row exists two times it will have the same row key both times and a single tomb stone entry (since tomb stone entries are to have the same partition key and row key as their main table entries). But if I have a way of uniquely identifying a row and seeing if it was repeated then it has columns that can be used together to create a unique key and I should have used that unique key rather than a random key for the row key. In other words if this situation comes up then I have a design flaw in my system.

Generally speaking if I’m using a random row key its because I have cases where two rows can have absolutely identical values but need separate identifies. In that case the only way I would find a row is not via query but rather because some other row is pointing at that row. So disaster recovery would only apply if I had some row that did exist that pointed at the row with the random row key and the pointed at row should exist but doesn’t. In that case I know exactly where to look in the tombstone table and the problem described above doesn’t apply.

Contents