As part of creating a service platform, e.g. a platform focused on the creation and consumption of services by other services, we realized that we needed a consistent way to model messages within our system. This isn't about creating some universal message model for everyone, everywhere. This is strictly about creating a message model for our use. But I think the issues we are dealing with are fairly universal so I thought it would be interesting to share our current thinking. Please keep in mind that this is all very preliminary and subject to change without notice.
[Note: Updated to add a section on extensibility]
As part of our work in Live land we are trying to figure out how to represent information in messages, e.g. we need an infoset. In our case the work we are doing is focused exclusively on machine to machine communication. This isn't about, say, markup languages which are primarily focused on adding machine processable semantics to primarily human focused data (e.g. strings decorated with elements). But even in machine to machine communication we share many of the same issues around extensibility and multi-platform support that markup languages have to deal with.
To make matters more complex it's pretty obvious to us that we need to have support for multiple data serializations. At a minimum we think we have to support XML and JSON but following the old computer science rule that there are only three numbers "0, 1 and infinity" this means we will eventually have to support more. So we want an infoset that will allow us to serialize across these various formats. Inevitably this means either taking a lowest common denominator approach or 'tunneling' our infoset through other infosets. Tunneling is particularly scary because tunneling is what leads to things like the massive stack of WS-* specs. But, to a certain extent, that's just tough for us. We have to support multiple serializations and for sanity sake we really can only have one infoset so we'll try to create one that strives for lowest common denominator but there is no doubt in my mind that we will end up tunneling. When the tunneling gets too painful we will have to revisit our infoset and/or the serializations that we support. There are no magic solutions.
I don't expect to see any code written directly to the infoset specified here. Rather, in another article, I will publish our thinking around a schema based on this infoset. So the infoset is not something our users would run into on a regular basis. But what kinds of structures we can and cannot put into our messages will be controlled by the infoset we choose.
The Proposed Solution
The Live infoset consists of two information items, the element information item and the string information item. Each information item has the properties specified below. When we refer to Unicode it is to the abstract unicode character planes rather than to any specific encoding (e.g. utf-8, 16, 32, etc.).
Element Information Item
Name – A globally unique name which MUST consist of a reverse DNS path. Note however that RFC 3490 is not necessary, unicode characters may be used directly.
Parent – A pointer to the parent element of the current element or null if the current element is the root of an infoset
Children – A list of element information item pointers
Ordered – A Boolean value. If true the list of children is ordered, if false then it is not.
The parent/children relationship MUST form an acyclic single rooted tree. That is, each element can only be listed in at most one children list and the pointed at element's parent MUST be the same element whose children list the element appears in.
String Information Item
Value – Contains an ordered series of zero or more unicode characters.
Parent – A pointer to the element that contains the string.
A string MUST appear in exactly one children list of an element.
Discussion About The Proposed Solution
Below I walk through the decision making that went into creating our infoset. Yes, I know, it has two components, how complex can its creation be? But in reality we started with the XML infoset and then cut things out. The discussion below explains what we left out and why.
Diffing off the XML InfoSet
The obvious place to start is the XML infoset because, well, it exists and lots of people have had time to review it. In going through the infoset we can throw out the processing instruction information item, the unexpanded entity reference information item, the document type declaration information item, the unparsed entity information item and the notation information items without a second thought. These are all XMLisms that are of no relevance to anything we are trying to do.
I believe we can also throw out the comment information item because comments are not something we are going to make available in our infoset. They might be in the serializations but they aren't part of our processing model. E.g. we will not specify any semantics that directly or indirectly rely on the use of comments.
I would also throw out the namespace information items. The split between 'local names' and 'namespace names' in XML is an artifact of XML's history and not something that we would want in our infoset (see here for some background). Elements have globally unique names and that should just be that.
The character information items make some sense to me, especially because markup languages are all about interspersing text and elements. But our infoset will be simpler. For reasons explored later on we prefer to have a string information item rather than character information items.
A subject of more than a little debate around here has been the attribute information item. I was discussing this with Mark Nottingham from Yahoo and he made the point that attributes are great so long as no one but him are allowed to use them. The point being that there are certainly cases where having an attribute available makes the data model simpler (think IDs, for example) but that in practice people abuse attributes. The fundamental problem with attributes is that they are not, strictly speaking, necessary (E.g. you can always encode in elements what you can express in attributes) and they are not extensible since they are just strings. After much discussion our opening position is that our infoset won't support attributes. On balance we think attributes cause more harm than good. We'll see in practice how well this holds up.
The document information item contains lots of interesting data that isn't relevant to us. It has things like PIIs, comments, DTDs, etc. It also have version information but this is version information for the serialization, not for the XML infoset. The two are not necessarily the same. I will talk about infoset versioning issues in a later section but in practical terms we do not need the document information item.
If the logic holds up this means that our internal infoset needs exactly two information items, a string information item and an element information item.
Element Information Item – Naming, Order, Banning Markup and Such
The element information item has a number of properties in the XML Infoset. Since our infoset gives each element a globally unique name we can throw away all the detritus of the hacked in XML namespace model. So the namespace name, local name, prefix, namespace attributes and in-scope namespaces properties can all be quickly discarded and replaced with a single name property that provides the globally unique name for the element.
For simplicity sake we will name our attributes using DNS as a mechanism to both provide uniqueness and still make the result human readable. Specifically we will use the reverse DNS convention. E.g. a first name element could be named com.microsoft.live.schemas.firstName. For cases where we want to serialize to XML we can trivially break this up into a namespace (e.g. data:com.microsoft.live.schemas) and a local name (e.g. firstName).
As we aren't supporting attributes we can get rid of the attributes property. The base URI property won't be supported because none of our early scenarios require it but I am confident that we will end up adding it back in as one of our first extensions. We will need the parent property although we can restrict it so that its only legal value is an element.
This leaves the children property. In the XML infoset the contents of the children property are ordered. But in our infoset most data will be unordered. Order is a very big deal for markup languages because the human understandable semantics are embedded in the ordering (e.g. "putting off" does not mean the same thing as "off putting"). But in a machine focused language order is typically not all that relevant. If one thinks of say, a buddy list, the actual order of the buddies is usually a secondary consideration that deals more with human needs than machine ones. For example, if we are returning a buddy list the IM client receiving it could display the contents in any number of orders so the order that the list is sent in over the wire doesn't really matter.
There is however one major counter example, search order. When an expensive search operation is requested it is common to also specify the order in which the results are to be returned. Such an ordering can always be explicitly encoded into the results (e.g. each result member could have an explicit order number associated with it) but this doesn't seem to happen in practice.
I suspect the subject of ordering needs more investigation but for now our intention is to add in an explicit ordering property with a boolean value that if true means that the children property's value is ordered otherwise it is unordered.
The children property itself is restricted so that its values must either be pointers to a series of elements or a pointer to a single string. The justification for this restriction is that we are only interested in self describing data and strings are not self describing (at least not to a machine). So any time a string is used we need it to be wrapped in an element in order to provide machine processable semantics. Since we are only worried about machine processing this seems a reasonable restriction.
String Information Item – What To Do About Whitespace?
The XML Infoset character information item has three properties, character code, element content whitespace and parent. In our case we will use a string, not a single character and we will allow all unicode characters without restriction. We view it as a serialization problem to escape the unicode characters in order to fit in any particular serialization format. We will address such serialization issues on a serialization by serialization basis (e.g. our solution for JSON will be different than XML because they have different reserved characters). All unicode characters are relevant in our infoset so we don't have the concept of whitespace handling, vis a vis XML. But when we specify a XML serialization we will have to address the whitespace issue.
There seems to be a fairly obvious pattern in the growth of data formats over time. First, there was ASCII (ANSI, EBCDIC, etc.) which was linear. Then there was HTML (SGML, XML, etc.) which is hierarchical. So the next step would seem to be something graph based. Yes, I know, RDF will save us all. But until we can truly welcome our new RDF overlords my suspicion is that we will just have to make due with hierarchical rather than graph based data formats. For all of the use cases we have floating around hierarchical is more than sufficient. Until we start to get compelling use cases for a graph format within our system we will require our infoset to be hierarchical.
Versioning the Infoset
One of the cardinal rules of system design is not to paint oneself into a corner. So it would seem natural to worry about versioning the infoset itself. But what would such versioning mean? Would we put into each message we serialize based on a specific infoset an identifier for that infoset? That would be our first example of tunneling. What would such an identifier achieve? The only time I can see such an identifier being important is if we implicitly rather than explicitly include semantics in a message that are derived from the infoset.
For example, lets say that we extend the infoset to specify that all relative URLs in the infoset are relative to the root element. But when we serialize an infoset instances into say, JSON, we include relative URLs but we do not include any object or other identifier that explicitly states "relative URLs are relative to the root object". Someone trying to resolve our URLs in the message won't be able to unless they are using the same version of the infoset and so understand how to resolve the URL.
In a case such as the above it would make sense for us to include an identifier for which infoset we are using so that systems could understand when a message may have semantics they don't support.
But I actually think this approach would be a mistake. After all, Live needs to talk to lots of non-Microsoft third parties and we certainly cannot expect those third parties to be using our infoset. Therefore I would argue that any time Infoset semantics leak into a message (e.g. ordering, base URLs, whatever) those semantics need to be explicitly and individually marked in the message. So, for example, in the case of relative URLs we would either need to introduce our own JSON object saying "this message interprets relative URLs relative to the root" or, much better, we would work with the JSON community to come up with a community standard to handle relative URL resolution. But in no case should we put ourselves in a position where third parties need to know which version of our infoset a particular service is using in order to figure out how to use that service
So long as we stick to the rule that the infoset's semantics are explicitly made known in the message there is no need for us to explicitly version the infoset itself.