Does SOA Need A Reliable Messaging Protocol? – Stuff Yaron Finds Interesting

I believe that there is a real need for an 'exactly once' reliable messaging protocol in SOA but that the other forms of reliable messaging (e.g. at most once, at least once and ordered) do not make it into the 80% column so we shouldn't bother with them, at least in the standards world.

Politics to the Rescue

It is the dream of all SOA developers to be able to guarantee that a message arrived at it's destination or to know for sure that the message never arrived. Unfortunately this guarantee is impossible to provide in the real world. But, where technology has failed us, politics can save us.

It turns out that when most people talk about a reliable messaging protocol what they really want is a blame transference protocol. That is, they want a protocol that takes the responsibility for making sure something happened off of them and puts it on someone else. The simplest and most widely accepted blame transference protocol uses some form of acknowledgment.

Service A                            Service B
-----------------Request---------------->
<--------Acknowledgment of Request-------

As soon as service A receives the acknowledgment of the request then the request is officially Service B's problem. The obvious question however is – what happens if Service A sends a request and doesn't get an acknowledgment back? Just sending a single request and then declaring the situation someone else's problem is not going to enable a sufficient level of blame transference. A little more effort is required. Typically Service A will be expected to repeat its request a few times in order to see if it can get its request through. After a number of tries if no acknowledgment is received then Service A can declare the transmission a failure and blame Service B.

The only problem with repeating a request multiple times is that service B may process the request multiple times. E.g.

Service A                            Service B
-----------------Request---------------->
      X<---Acknowledgment of Request-----
-----------------Request---------------->
      X<---Acknowledgment of Request-----
-----------------Request---------------->
      X<---Acknowledgment of Request-----
etc.

In the previous example Service A keeps repeating its request but due to some network or other problem none of Service B's acknowledgments are getting through. So Service A just keeps on repeating its request until it finally gives up. The problem is that if the requests are "move $100 from account Foo to account Bar" then repeating the request over and over again until an acknowledgment is received is likely to get Service A into serious trouble. And no, we cannot just make all methods idempotent and so make the whole reliable messaging issue go away.

Different Flavors of Reliability

The problem I set up above is a pretty classic example of an 'exactly once' messaging system which promises to deliver a message one and only one time. But there are several other flavors of reliable messaging. For example, WS-ReliableMessaging includes four different kinds of reliable delivery: exactly once, at least once, at most once and ordered delivery. Does the world really need this many flavors of reliability? I tend to think not.

Exactly Once Messaging

This style of messaging purports to guarantee that a message will arrive one time and only one time. Of course no such promise can be made in the real world but what they really mean is that every message will have a unique ID (to detected repeated messages) and that there is some kind of acknowledgment mechanism to let the sender know when their request has arrived. Since 'reliable messaging' is actually impossible to implement in the real world this style of messaging is really just a blame transference protocol. It is also, in my experience, the most requested form of reliable messaging. For enterprise SOA being able to place the blame on someone else is a business requirement and often plays a part in contract issues.

At Least Once Messaging

This form of reliable messaging "guarantees" that a message arrives at least one time but it could arrive more times then that. In other words this is reliable messaging using acknowledgments but not message IDs. This means messages can be repeatedly executed but at least there are acknowledgments to let you know if any of the repeats arrived. At least once messaging is typically used for notification operations like "The Stock price is 6". But in real world systems this type of messaging is usually implemented using a 'blast them' strategy. Basically, instead of sending the notification once and waiting for an acknowledgment the system will just keep repeating the message every few seconds and not bother having any acknowledgments. It's usually cheaper to just preemptively repeat the messages than manage an acknowledgment system. Personally I don't think this scenario rises above the 20% in the 80/20 trade off so I don't think it's worth standardizing.

At Most Once Messaging

This only guarantees that a message won't be repeated, not that it will ever arrive. In other words this is reliable messaging with message IDs but without acknowledgments. The idea is that if the message is repeated a bunch of times for whatever reason you may never find out if the message was received but you are guaranteed it won't be processed more than once. In theory this type of messaging is useful for 'fire and forget' systems where repeats could be dangerous. For example, imagine that instead of saying "The Stock price is 6", the system instead said "The Stock price is $1 higher than its current price." In that case preventing duplicates is a big deal. But this is also really bad design, besides how can it make sense to care about preventing duplicates but not if the message ever arrived? I just can't convince myself that this example rises above the 20% realm either.

Ordered Delivery

This type of messaging is typically used with 'exactly once' delivery to make sure that not only will messages arrive but they will arrive in a specific order. The way this works is that each message is given a sequence number so the receiver will only process the messages in the specified order. Ordered delivery only makes sense for systems where multiple requests can be pipelined without first having to know the response to previous requests but where ordering matters. A classic example is incremental editing.

Imagine there is a XML structure stored on some service of the form <foo><bar/></foo>. Now imagine you send a command "Insert <bar/> as the first child of <foo>" which changes the structure to <foo><bar/><bar/></foo>. Now imagine you send the command "replace the first <bar/> child of foo with <ick/>". So now the structure is <foo><ick/><bar/></foo>. In theory the two commands, to add <bar/> and to then replace it with <ick/>, could be pipelined. But ordering really matters since the <ick/> command would produce the wrong result if it were executed before the "add <bar/>" command. So suddenly message ordering seems reasonable.

Except what happens if the "add <bar/>" command fails? In that case you aren't going to want the <ick/> command to be executed. So now a generic reliable messaging system isn't going to be enough. You have to have application level knowledge so you know if any member of the sequence fails then all following commands have to be failed out as well. The situation can get arbitrarily complex. For example, imagine that some of the future commands are about adding an attribute to the <foo> element. In that case it doesn't matter that the "add <bar/>" command failed, at least to commands editing part of the document that are out of scope of the failed command. So in reality the dependency ordering can get really dizzying.

My guess is that pipelining in general is not an 80% case. Furthermore, even for those who can handle pipelining, the results usually are so application specific that a generic infrastructure just isn't all that useful. Most folks are probably better off skipping the possible performance benefits of pipelining and just serializing their requests (e.g. send request, wait for response, send next request, wait for response, etc.). For those few cases where the performance benefits are so enormous as to be irresistible I'd suggest using some kind of reference system so that a message can explicitly say "I depend on the successful completion of message X". This allows for significantly more sophisticated parallelism than sequence numbers can deliver.

For me the bottom line is that Ordered Delivery is too expensive (and too sophisticated) for the average case and offers too little value for the complicated case to be worth implementing.

Conclusion

Every form of reliable messaging has its audience. But just because a protocol feature is useful to someone doesn't mean it should be foisted on the entire community or be made into a minimum requirement. This is especially the case with something like SOA which is already complicated enough. If a feature is not absolutely mission critical to a large percentage of the community then we shouldn't add to the standards noise by standardizing it. Therefore my two cents is that the only form of reliable messaging that falls into the 80% column (at least for SOA in the Enterprise) is exactly-once messaging and so that's the type I'm going to try and put together a brain dead simple protocol for.

Appendix 1 – Why "Reliable Messaging" isn't so Reliable

Imagine that service A sends a request to service B. All service A wants to know is 'did the request arrive'? To answer this question service B is programmed, upon receiving the request, to immediately send back an acknowledgment message. But the problem is, how long should service A wait to receive the acknowledgment before assuming the request never arrived at service B? Let's say the limit is N seconds. So fine, N seconds pass, no acknowledgment message arrives at service A and so service A assumes the request never arrived. Of course it's completely possible that the request did arrive and that service B did send the response but due to network conditions, firewall problems, sun spots, etc. the acknowledgment didn't make it to Service A in the N second window. Now the situation is a complete mess. Service B did get the request and probably has processed it but service A doesn't know this and thinks the message was never delivered and hence not processed.

Over time various attempts have been made to work around the previous problem. For example, what happens if service A could send an acknowledgment of the acknowledgment? And then, just to make things more interesting, what if service B would refuse to actually process the message until it received service A's acknowledgment of B's acknowledgment that it received the message? In other words, what if we used a variant of a two phase commit protocol? A picture might help to clarify:

Service A                            Service B
-----------------Request---------------->
<--------Acknowledgment of Request-------
----Acknowledgment of Acknowledgment---->
                          (Now Service B processes the Request)

This exchange would seem to guarantee that service B will only process the request if Service B knows that Service A knows that Service B received the request (say that three times fast). This may seem foolproof but what happens if service A's acknowledgment of the acknowledgment never gets delivered? Since Service A has no way of knowing that its acknowledgment of the acknowledgment wasn't delivered it will assume the message has been delivered and act on that assumption. Service B on the other hand, having never received the acknowledgment of the acknowledgment assumes that Service A doesn't know that the request was successfully received and so refuses to process the request. Of course one could have an acknowledgment of the acknowledgment of the acknowledgment…

The way this situation is usually dealt with is via a 'fail safe'. Because B will refuse to process the request when it fails to receive the Acknowledgment of the Acknowledgment no 'harm' will be done to the system since the request won't be processed. But 'harm' is in the eye of the beholder. For example, what if the request was for a part that a just in time factory absolutely must have or the production line will be shut down? The factory (service A) thinks its request was delivered and doesn't know that the manufacturer (service B) who was to deliver the part hasn't processed the request at all. Somebody is going to get fired.

I can't blame most people for being confused about what reliable messaging actually delivers. Unfortunately there is a lot of misinformation about reliable messaging. Users have been repeatedly told that the outcome of a reliable messaging protocol is either "Message was delivered" or "Message couldn't be delivered." But as the above shows there is actually a third state that nobody likes to talk about, mostly I suspect because it would make reliable messaging look a lot less valuable. That third state is "I have no idea what happened" and equally unfortunately this third state is not at all rare.

So when you write code using reliable messaging make sure that you can get back all three states as a possible response and that your code is able to handle all three states.

Appendix 2 – Idempotence

Idempotence means, in essence, that one can repeat an operation multiple times and in each case the resource will end up in the same state. A classic example of a "naturally" idempotent operation is delete. You can delete the same resource multiple times and the output is always the same, the resource is deleted. If all operations were idempotent then we really wouldn't need a reliable messaging protocol. We could just repeat operations until we get a response.

The problem is that making sure a operation is truly idempotent across all possible scenarios is so difficult in practice that I strongly suspect most folks would rather just use a reliable messaging protocol.

Below is an example of the use of the delete operation across HTTP, in other words, all communication is synchronous:

Service A                    Service C                        Service B
   --------------------Delete /foo/bar----------------------------> (Resource is deleted)
             X<--------Confirm Success-----------------------------

                               ------Created /foo/bar-------------> (Resource is created)
                               <--------Confirm Success------------

   --------------------Delete /foo/bar----------------------------> (Resource is deleted)
   <----------------------Confirm Success--------------------------

In the previous example the TCP/IP connection broke before Service A could get Service B's response telling Service A that the resource had been deleted. But, before Service A could repeat its request Service C showed up and recreated the deleted resource. When Service A repeated its delete request did it intend it to delete the newly changed resource that Service C created? This scenario is a variant of the 'lost update' problem. Depending on the intended semantics of the situation the previous example was either exactly correct (Service A wanted the resource deleted, period) or it was wrong (Service A wanted a particular version of the resource deleted and hadn't intended to delete Service C's work).

First and foremost the interface designer has to really understand their operation's semantics and be very clear on the consequences of what they are doing. If they wanted the resource deleted then the 'lost update' isn't a problem. But if they only wanted to delete a particular version of the resource then they need to do more work. In the case of HTTP they could use the ETag header. The ETag header is an identifier that identifies a particular 'state' of a resource. So if Service A used an ETag then its second delete request would have been rejected because Service C's create command would have changed the state of the resource and thus made the ETag value included in the delete (which was retrieved before the resource was changed by Service C) invalid.

So yes, delete can be made fully idempotent without a reliable messaging protocol but only if the developer is really on their toes. But other operation types don't work so well. For example, imagine that a bank wants to implement an 'add X dollars' operation to an account management interface. This operation is clearly not idempotent since calling it multiple times will constantly change the state of the resource (each call will add an additional X dollars).

One way to solve this problem is to change the operation to be "add the currency unit with ID Z (which is worth X dollars) to this account." In other words, rather than incrementing a counter by X you would instead move around 'currency units' which had fixed values and unique IDs. Issuing the currency unit command multiple times would be idempotent because the service could compare the currency ID in the operation to the currency IDs already in the account and detect if the currency unit is already in the account. Unfortunately the consequences of this design are fairly unpleasant. For example, instead of recording a single floating point value one now has to keep track of a whole bunch of bogus "currency unit" IDs and what exactly do you do when you need to withdraw money? How do you break a currency unit? Oh and, doesn't all the ID tracking sound exactly like what you have to do in a reliable messaging protocol anyway? Wouldn't it be easier to just use a reliable messaging protocol and let is handled this nonsense in a modular way and keep these issues out of your application's architecture?

The idempotency problems get even worse, btw, with asynchronous messaging. Going back to our favorite delete operation:

Service A                            Service B
---------------DELETE #1---------------->
(Time Out)
---------------DELETE #2---------------->
<--------RESPONSE TO DELETE #1 - REJECT--
    X<---RESPONSE TO DELETE #2 - ACCEPT--

Service A sent a delete request to Service B. Not having received an asynchronous response in time Service A decides to just send another delete operation. The first delete was received by service B and rejected because Service A did not have authorization to delete the resource. Service B immediately sent a response but the response got held up in the network. Between the time Service B received the first delete and the second delete the authorization configuration was changed to allow Service A to delete the resource. So the second delete was processed successfully. Unfortunately, due to a network error, the second delete response never arrived but Service A is happy since it got the first response. The end result is that Service A thinks that it couldn't delete the resource and that the resource is still there, which in fact is not the case at all.

For an operation to be idempotent it has to always produce the same resource state. But in the previous example a change in part of the behavior of the resource altered the state that the delete operation would put the resource in. If Service A had sent the delete operation multiple times before the authorization change or only after the authorization change then the operations would have been idempotent. But Service A got caught in the middle of a system state change that rendered its method no longer idempotent.

This problem could be solved in a similar way to the previous delete problem. HTTP ETags, which were invented for caching purposes and so really don't care about authorization, don't capture authorization state so another kind of tag would have to be introduced. Somehow though adding an entire new kind of tag as well as a new round trip to retrieve its value just so the delete method (and pretty much only the delete method) could be idempotent in an asynchronous messaging situation seems a bit overdone. Besides, isn't that new tag more than a little bit like the messaging ID reliable messaging systems use?

The summary then is that:

Knowing when a method is truly idempotent is non-trivial and therefore very easy to get wrong.
Generally speaking the amount of work needed to make an arbitrary method idempotent is significantly more then what it would take to just use reliable messaging.

4 thoughts on “Does SOA Need A Reliable Messaging Protocol?”

Bernd Eckenfels says:

October 13, 2005 at 11:26 pm

From my point of view Java needs a SOA API. There are currently only two APIs which allow the kind of QOS you need in business applications: JAXM and JMS. The first is somewhat dead and only very general specified, and both are not WebService oriented, since you cant really easyly transport a RPC call.

So the question is, if we will see RM Providers who will work as JMS (including transaction propagation?) or as JAX-RPC providers.

Gruss
Bernd

Administrator says:

October 14, 2005 at 7:50 am

I’m not too worried about the language bindings actually. I think the key is to come up with a protocol that calls for an architecture that is brain dead easy to implement. In that case there will be tons of implementations in multiple languages and people can choose whatever solution they prefer. So long as the protocol enables interoperability the cacophony of languages won’t be a big deal. I think that is certainly a lesson we can take away from HTTP and to a less extent XML.

Bernd Eckenfels says:

October 14, 2005 at 11:26 am

Yes, but I as a service consumer/provider dont care about the protocol. If i have an ESB (JMS) or App Server (Remote Session Bean) I do have perfectly safe XA transactions without inventing any new protocol. And I dont want to use another API. If there is an API which can be expected to work I can start to implement SOA today and dont have to wait on the third WS-R spec…

Gruss
Bernd

Administrator says:

October 14, 2005 at 11:57 am

Yeah but using an XA interface would actually be bad because reliable messaging will never be able to get even close to the kind of guarantees that XA provides. Nor, btw, do I think we should try to get to that level of guarantee across a network.

I’m hoping at some point to get to put together some code for my reliable messaging solution as an open source project. I’ll start out probably with an Apache Plug-Ins (since my protocol will be HTTP based) and then see about exposing it across JMS and maybe JBI.