How Should An Exactly Once SOA Reliable Messaging System Be Designed?

So my guess is that one can design a really nice exactly once reliable messaging protocol using exactly two headers (a message ID and a time stamp) along with a few standard error responses. For bonus points I can throw in a header giving an idea of how long the system remembers message IDs. Below I explain how I reached the conclusion that this is all that's needed.

Message IDs

I discussed at length in a previous article the problems related to idempotency. Based on that discussion I'm going to say that an exactly once reliable messaging system for SOA needs to include a dedicated message ID. I realize this isn't controversial to most people but I suspect the RESTifarians might be displeased. But until someone can help me see how to achieve the exactly once goals reliably and simply without a message ID, I'm going to stick with the message ID.

Just to make things more fun I'm going to assume that the message ID is generated by the client. The ID should probably be a UUID. They are dirt cheap to generate and do a nice job of preventing collisions. Of course well designed receivers will always index the message ID with an identifier for the sender. That way if the sender screws up and starts re-using ID values only that sender (and no one else) gets hurt.

Synchronous Reliable Messaging

The point of synchronous reliable messaging is to handle the case where you have sent a request but for any number of reasons did not manage to get back your synchronous reply. In that case you want to be able to try the request again. From an architecture perspective this is a caching problem. In other words the first time the receiver successfully gets your request it will generate a response and then it needs to cache that response so if you ever repeat the identical request (e.g. same message ID from same authenticated source), it can return the cached output.

Asynchronous Reliable Messaging

The term asynchronous reliable messaging in a certain sense is misleading because it can cover two very different scenarios. My concern is that only one of these two possible scenarios actually matters in the real world.

Scenario 1 – One Way Reliable Message Exchange Pattern (MEP)

Service A                            Service B
-------------Request--------------------->
<------------Acknowledgment---------------

In this scenario Service A sends a one way message to Service B who then acknowledges it. From a protocol perspective the transaction is completed. If Service A repeats its request (because it never received the Acknowledgment) the only responsibility Service B has is to repeat the Acknowledgment.

Scenario 2 – Asynchronous Request/Response MEP

Service A                            Service B
-------------Request--------------------->
<------------Acknowledgment---------------
<------------Response---------------------
-------------Acknowledgment-------------->

In this scenario Service A sends a request and receives an acknowledgment. Later on Service B sends the response and receives an acknowledgment. But, and this is the important part for the purposes of this article, the fact that there is supposed to be a response is known at the protocol level. The impact of this 'knowledge' is that one expects more sophisticated behavior than in the first scenario. For example, let's say that Service A successfully delivers its request to Service B who has processed it and generated a response. But Service B died just as it was trying to send the response and due to a bug in its software (and yes, protocols should expect bugs) when it came back up it didn't try to deliver the response again. So one can imagine Service A repeating its request and expecting this to act as a 'prod' to Service B not just to return an immediate Acknowledgment but also to immediately send down the Response as well.

I'm not 100% convinced that scenario 2 actually exists. For example, although WSDL 1.1 explicitly provides for asynchronous request/responses real world implementations never supported them. What you had to do was specify two one-ways and then handle their relationship at the application level. This actually makes a lot of sense because once you get into asynchronous messaging you find out that you often end up with nasty MEPs ranging from 1:N to various other strange interactions.

So my general instinct is to not worry about asynchronous request/response MEPs from a reliability perspective and to exclusively focus on one-way reliable messaging. The only difference then between synchronous reliable messaging and one way reliable messaging is that a one way, rather by definition, doesn't have a response. So while in synchronous reliable messaging we can treat the response as being the acknowledgment in the case of one-ways we will need an explicit acknowledgment message.

What was I Going to Say? Oh Yeah, Forgetting IDs

One possible problem is what happens if a receiver forgets a message ID. In that case the same message could be processed twice. It is very tempting to just say 'remember all message IDs, forever' but in practice this can be very expensive. Let's take the cheapest possible case, reliable one-way (where you don't have to remember the response). Let's say that the message IDs are around 32 bytes long (yeah, yeah, I know, UUIDs are only 16 bytes but you can't guarantee that someone will always use a UUID). Let's also say that the user ID is in the 32 byte range as well. That means that one needs a database record with 64 bytes. I'll ignore indexing costs for the moment and assume the database uses some nasty byte packing scheme to save space. Let's say the service is pretty successful and gets 300 messages/second every day, all day. In a period of one year the system would have to remember 64 bytes/m * 300 m/s * 31,536,000 s/year / 1.099512e+12 bytes/terabyte = 0.55 terabytes per year. That's a non-trivial amount of data to waste on storing IDs, the vast majority of whom will never be used or needed again. The situation is potentially much worse in the case of synchronous request/responses where the actual response has to be stored in the database as well. In any case, I suspect we are going to need a way to safely 'forget' IDs.

A simple way to deal with the ID expiry problem is for the sender to know how long the receiver will remember an ID for. The sender then knows not to repeat any messages where the time that has elapsed since the first copy was sent exceeds the time the receiver can remember IDs for. The problem is that networks can and do hold on to messages for very long periods of time. Throw in an intermediary and all bets are off for how long a message can be deep sixed before showing up again. The traditional solution to this problem is to put an expiration date on the message so as to guarantee that the receiver won't process the request past its 'memory' window and so can't accidentally process the same message twice. But even that solution is problematic because receivers can lie. Due to system crashes, misconfiguration or other ills it is possible that the memory window the sender thinks the receiver is using and the memory window the receiver is actually using aren't the same. In that case the expiration date could be too far into the future (e.g. the receiver is using a 2 hour window but the sender thinks the receiver is using a 3 hour window) and thus result in double processing a message.

A simpler solution would be to put a 'first instance' time stamp on each copy of the message that indicates the time at which the sender sent the first copy of the message. The receiver would then look at the 'first instance' time stamp on any message it receives and if the time indicated is beyond it's memory window (e.g. the receiver is using a 2 hour memory window but the 'first instance' time stamp indicates a time 3 hours in the past) then the message would be rejected. As a matter of manners the receiver, in its error, should indicate what memory window its using but senders need to understand that the returned value is a weak promise not a firm commitment.

The 'first instance' time stamp is also useful to deal with situations that, strictly speaking, shouldn't happen. For example, a decently implemented reliable messaging system shouldn't loose IDs just because it crashed. But unfortunately this can and does happen. So a nice addition to a 'first instance' time stamp implementation is a check of 'how far back does my memory go'? E.g. even if the system does support a 2 hour window if the system crashed an hour ago and lost all state then it would have an extra check that says 'oh and reject anything that was sent before an hour ago'.

Admittedly this solution does require that both the sender and the receiver have reasonably synchronized clocks but NTP is now 20 years old and I think that's enough time to expect people to know how to use it. Yes, I know, some devices don't even have clocks. Um, tough, put one in. It's just not that hard and anyway I'm mostly worried about Enterprise SOA scenarios where the whole network is just lousy with clocks.

Memory Window Sizes and Repeat Intervals

Most reliable messaging protocols provide some provision for discovering memory window sizes and for desired repeat intervals (e.g. if you haven't received the acknowledgment then retry your request every 20 or 30 seconds). To a very large extent I think this kind of configuration is more trouble than its worth. Memory window sizes strike to the very nature of a transaction and a system that was expecting a 10 hour memory window and gets 5 minutes is probably just going to break on the spot. So my guess is that in most real world situations the designers on both ends will already have to know roughly what kind of memory windows they are going to get because it affects so much of their code. It doesn't really matter all that much if their guesses aren't quite identical, the 'first instance' time stamp will catch any mis-match, it's more important that the two values be in the same ball park. Still, it would be friendly if it was possible to ask a system what its memory window is and the protocol should provide for that. It's useful for debugging if nothing else.

Repeat intervals are really more a matter of manners than anything else. Just because a receiver says 'please, don't repeat your messages more frequently than every X seconds using an exponential back off function' doesn't mean anyone will pay attention. Sure, the receiver could try to enforce it's requested behavior by rejecting badly behaving nodes but those kind of cases are extreme anyway. My guess is that very quickly 'good behavior' will be agreed upon and the world will go on its way.

Conclusion

It would seem that one can design a quite nice reliable messaging protocol using nothing more than a sender generated message ID and a 'first instance' time stamp. Throw in a couple of standard error messages and maybe a nice friendly 'here's my usual time out window' and your done. In an upcoming post I'll specify just such a system for HTTP.

2 thoughts on “How Should An Exactly Once SOA Reliable Messaging System Be Designed?”

Olli says:

October 15, 2005 at 3:24 am

Hi,

what happens if

Sender sends request with timestamp
Receiver processes request and sends acknowledgement
Acknowledgement get’s lost because of network break
Receiver repeats request with first-instance timestamp but the network is fixed only after the window, so the request will be rejected because of the first-instance timestamp is too old
Now the sender does not know if the request is processed and it has no way to repeat because if he would use a new ID with a new first-instance timestamp this can (and in this scenario would) lead to a double processed request.

Administrator says:

October 15, 2005 at 8:56 am

Olli, unfortunately the scenario you describe is unavoidable if a system is allowed to forget IDs. But as I describe in the article being able to forget IDs is a requirement (especially for synchronously reliable systems where the actual response has to be cached) because of space considerations.

The key then is in designing a particular service to make sure that the time windows are ‘reasonable’ given the expected usage. But yes, there will always be failure cases. Without infinite memory they are unavoidable.