Writing Backwards Compatible XML Schema 1.0 Schemas Using the XML Ignore Rule

Writing a XML Schema is a challenge but as the first part of this document explains, writing a V2 schema that can accept V1 documents is in most cases impossible if you use XML Schema 1.0. In other words, if you want to write a backwards compatible schema you probably won't be able to do it using XML Schema 1.0. In an ideal world we would take the lessons learned from XML Schema and use them to start over, probably with RelaxNG. But until we can move over to a new standard we need a way to enable backwards compatible schemas to be written in XML Schema 1.0. Therefore the second part of this document explains how to use the XML Ignore Rule in conjunction with XML Schema 1.0 in order to create a validator that enables one to write backwards compatible schemas. The XML Ignore Rule can be best summarized as "if you don't recognize it, ignore it."

A Problem

Folks like David Orchard, in his blog, have written a lot about the problems of backwards compatibility in XML Schema. I'm not going to repeat all that text but rather show one example that illustrates a key problem that explains why XML Schema 1.0 is unsuitable by itself for use in XML applications that require backwards compatible schemas.

Imagine that party A has defined the following XML Schema:

<xs:schema>
   <xs:element name="DoSomething">
      <xs:complexType>
         <xs:sequence>
            <xs:any maxOccurs="unbounded" processContents="lax">
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

Other parties should be able to take the schema and extend it any way they want.[1] So let's imagine that party B shows up and decides to add the DidSomething extension as a child of the DoSomething extension which contains an integer. E.g.:

<DoSomething>
   <DidSomething>123</DidSomething>
</DoSomething>

What's nice about the previous XML is that if someone who only supported party A's version of the schema were to receive the previous XML they wouldn't fail. After all, party A's schema clearly said that it was possible to put arbitrary content into DoSomething. This is the essence of backwards compatibility in XML. Someone can define a new schema that is an extension of an old schema and still have their new XML understood by processors that only understand the old schema.

Having created this extension party B now wants to write a schema that can achieve two things:

Validate any version of the DoSomething element that is valid against party A's schema
Validate that if someone sends them a DoSomething element with a DidSomething element inside of it that there is exactly one instance of the DidSomething element and that the instance contains an integer.

Another way to phrase these two requirements is that party B wants to write a schema that restricts party A's schema only in that if someone uses the extension created by party B then they must follow party B's rules. The previous sentence, btw, is another good definition of backwards compatibility.

Party B therefore, naively, creates the following schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           elementFormDefault="qualified" attributeFormDefault="unqualified">
   <xs:element name="DoSomething">
      <xs:complexType>
         <xs:sequence>
            <xs:choice>
               <xs:sequence>
                  <xs:any processContents="lax" maxOccurs="unbounded"/>
               </xs:sequence>
               <xs:element name="DidSomething" type="xs:integer" 
                           minOccurs="0" maxOccurs="1"/> 
            </xs:choice>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

The schema says that DoSomething can contain either anything or exactly one instance of DidSomething. Unfortunately the previous schema is illegal because it violates XML Schema's ambiguity rules.

The issue of ambiguity is the crux of a lot of XML Schema's issues so it's worth understanding it. Ambiguity, in the context of XML Schema, means that there is never a point in the schema where a single value could be validated by more than one rule. For example, in the previous schema if the contents of DoSomething are DidSomething then there is an ambiguity. DidSomething could be processed by the DidSomething element declaration but it could also be processed by the xs:any. Both rules could be equally applied to the DidSomething element. That is ambiguous and therefore the schema is illegal.[2]

What the ambiguity rules mean in practice is that it is impossible for party B to write a schema that can meet it's requirements because it can't both honor the xs:any and validate the proper use of its extension element. So even if someone remembers to put in xs:any elements all over the place it really doesn't matter that much since no one can really take advantage of those xs:any's in order to write a backwards compatible schema.

A Solution

I think we have learned a lot of useful lessons from XML Schema and ideally we would now replace it. I suspect that RelaxNG would be a good candidate technology for a new XML Schema standard. Unfortunately many companies have made enormous investments in XML Schema and they aren't about to turn away from it. They have baked it so deep into their infrastructure that they are stuck. Eventually some revolution will come along to overthrow the existing order but until it happens we need a way to enable XML Schema to express backwards compatible schemas.

Thankfully I'm not the first, second or third person to worry about this subject. The general conclusion the various experts seem to be coming to is that the only way forward is to use something along the lines of the XML Ignore Rule defined in the WebDAV Specification and broken out in detail in FXPP.

The XML Ignore Rule says that if you don't recognize an attribute then ignore it, if you don't recognize an element then ignore it and the entire sub-tree it roots. The XML Ignore Rule algorithm as applied to XML Schema 1.0 would be something along the lines of:

Read through the schema and create a central list[3] of all the element and attribute local/namespace name pairs used/declared within it. It is not necessarily an error for two or more attributes or elements with the same local and namespace name to be declared in different local declarations. If such duplication occurs then just record the duplicate local/namespace name pair one time in the central list. [4]
Serialize the XML document that is to be validated into an infoset. [5]
Start at the document element of the infoset and walk the infoset in document order. If an attribute encountered during the walk has a local name/namespace name that is not in the central list then remove the attribute from the infoset. If an element encountered during the walk has a local name/namespace name that is not in the central list then remove the element and all of its children from the infoset. Continue walking the infoset until all children of the document element are visited.
Validate the infoset output from the previous step using the schema processed in the first step.

Depending on the particular application one may want to either return the original XML document or the 'cleaned' infoset. Which one to return is an implementation specific decision and validators should offer both choices to their callers. What's especially nice about the XML Ignore Rule is that one doesn't need to worry about shoving in xs:any's all over the place. Having to explicitly mark extension points is a hopeless task anyway since no one ever has enough foresight to know ahead of time where extensibility is needed.

In a world with the XML Ignore Rule the schema that part A would define would be:

<xs:schema>
   <xs:element name="DoSomething"/>
</xs:schema>

Party B's schema would be:

<xs:schema>
   <xs:element name="DoSomething">
      <xs:complexType>
         <xs:sequence>
            <xs:element name="DidSomething"type="xs:integer" minOccurs="0"/>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

Now imagine that the following party B message was sent to party A:

<DoSomething><DidSomething>123</DidSomething></DoSomething>

Party A would navigate the document down to the DidSomething element, see that it is not on its central list and so delete it. The resulting XML document that party A would 'see' is therefore:

<DoSomething></DoSomething>

Which is perfectly valid against party A's schema, so all works just fine.

Still, the XML ignore rule isn't perfect. Lots of patterns, especially ones that involve re-using elements from the V1 schema, will be somewhere between difficult to impossible to implement. But given the alternatives I think the XML Ignore Rule is the simplest and most robust option.

A Conclusion

The first section of this document explained how XML Schema 1.0's rules around ambiguity make it all but impossible to write backwards compatible schemas. The second section of the document introduced the XML Ignore Rule which enables one not only to write backwards compatible schemas but also allows schema authors to get out of the business of having to guess exactly where extensibility should be allowed. The XML Ignore Rule is far from perfect but it's probably the simplest solution available for making backwards compatible schemas possible using XML Schema 1.0.

[1] There are some nasty complications around namespace handling but I'm going to ignore those for the moment by assuming that each and every extension must come from a different namespace. Strictly speaking this assumption is unnecessary and if one wishes it is possible to structure xs:any's in ways that can be more flexible about namespaces. But namespace handling really isn't the central point of this paper and I don't want to complicate things by focusing on them.

[2] The XML Schema working group is thinking of coming out with XML Schema 1.1 which would include 'low priority wild cards' Essentially these would be xs:any's that are always chosen last. So in the example above there would be no ambiguity because the schema would first look at the DidSomething element definition before it looked at the low priority wildcard. I don't believe Low priority wildcards are the best way to solve the backwards compatibility problems in XML Schema 1.0. First, low priority wild cards require schema authors to be seers who can see into the future in order to figure out exactly where extensibility in their schema is needed. Second, since schema authors aren't seers best practice will be to pour low priority wildcards absolutely everywhere in a schema so as to cover all possibilities with the consequence that the schemas will be unreadable. Third, low priority wildcards are not backwards compatible with existing schemas so we would have the odd situation where the 1.1 version of a spec is not backwards compatible with 1.0. If the XML Schema group wants to introduce low priority wild cards, something I hope they don't do, then they should at least change the spec's name to XML Schema 2.0. [6]

[3] Requiring a central list will make it more difficult, although not impossible, to re-use existing elements in a schema in new places. We could solve this problem by annotating parts of the schema with local information about what element/attributes are 'visible' at that particular location in the schema but this would be significantly more complex to specify and thus likely lead to incompatibility in implementations. I suspect it is better to stay with the simplicity of a central list even at the cost of some loss in functionality.

[4] XML Schema unfortunately allows for the use of location declarations which make it possible to define two elements or attributes with identical local/namespace names but completely different definitions. For example, the following is actually legal in XML Schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
           elementFormDefault="qualified" attributeFormDefault="unqualified" 
           targetNamespace="http://example.com/">
    <xs:element name="Dad">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Child" type="xs:string"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
    <xs:element name="Mom">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Child" type="xs:integer"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

In my opinion the value of namespaces is to provide globally unique names that would prevent collisions, a goal that XML Schema makes impossible by enabling the redefinition of namespace qualified attributes and elements. In terms of the XML Ignore Rule we don't worry about these types of contradictory declarations. We simply record a local/namespace name pair once in the central list no matter how many times it is locally defined within the schema.

[5] There really is no need to serialize the XML document to be processed into the infoset but I find it easier to discuss XML when I can use infoset terms.

[6] To be fair there is a long history of XML standards making incompatible changes without bumping up their version numbers. XML 1.1, for example, is not backwards compatible with XML 1.0. It changes the legal characters available for XML element names, attribute names, etc. The end result is that one can create a perfectly legal XML 1.1 document that will cause a XML 1.0 processor to fail. Section 1.3 of the XML 1.1 specification outlines other incompatible differences between XML 1.0 and 1.1. But XML 1.1 already had good precedent from XML Namespaces which effectively changed the meaning of XML element and attribute names from the XML 1.0 specification by re-defining in a completely non-backwards compatible way how to process those names.

A Problem

A Solution

A Conclusion

Leave a Reply Cancel reply