Yaron's Rules of thumb for writing good XML protocol schemas

Some important best practices for using XML Schema. Use ANY – To make a schema extensible it must be possible to add new children to an existing element. As a rule all elements in a schema should be extensible. This derives from the general architectural principle "don't paint yourself into corners". As such throw in ANY elements everywhere. In the early W3C schema specs it was possible to just set an attribute on the schema that made everything extensible but unfortunately this feature was removed.
An alternative is to use type substitution but this is a bad idea for protocols because type substitution only works if the processor has the extended schema which is usually never the case. (See http://www.xfront.com/ExtensibleContentModels.pdf)

Avoid default attributes – Default attributes make it all but impossible to digitally sign a piece of XML, unless you want to send your entire schema with every single message you transmit. As such avoid default attributes values.

Use XML namespaces – All elements and attributes should be qualified with an XML namespace. This enables new elements and attributes to be added later without fear of collision and without requiring a central authority to vet all names.

Set elementFormDefault to qualified – In protocol schemas it's critical to know where all elements came from, this is especially the case when debugging. Nothing is ever hurt by setting elementFormDefault to qualified and it can help. (see http://www.xfront.com/HideVersusExpose.pdf)

Avoid uncontrolled namespaces for non-XML element enumerations – If an enumerated type is expressed as a string (for example, the value of an attribute is an enumerated type) then make the string type "anyURI". This will ensure that the enumerated values are globally unique and extensible without requiring a centrally administered namespace manager.

Avoid attributes – The power of XML is its hierarchical structure that enables rich, structured and extensible expressions. Attributes rob us of all three qualities. Much like GOTOs, there are times when attributes make sense. The first case is when 99.9999% of the time the schema is to be typed in by humans, in that case attributes are easier to type. The second case is when providing limited meta data about the contents of an element (such as it's language). In that case attributes can be more elegant. (see http://lists.w3.org/Archives/Public/w3c-dist-auth/1998JulSep/0084.html)

Use namespaces for schema versioning – I believe that explicit versioning is just confusing in schemas, there are no need for version numbers which only causes fights over what is a minor upgrade versus major upgrade and how do you decide which version you can support and do you advertise the highest version you support or the version that the person you are talking to supports, etc.? It is much simpler for everyone to just use Option 3 listed in http://www.xfront.com/Versioning.pdf which basically says "If you change the meaning of the schema so that older systems can't process it then change the namespace". The key is to make sure that the namespace on the root element of your content is different then the old namespace. That way incompatible systems will fail immediately.

Global vs. Local – Re-use is the lifeblood of good protocols so generally the salami design is best. (see http://www.xfront.com/GlobalVersusLocal.pdf)

Avoid strings for enumerated types – Too often people use strings for enumerated types but XML elements play the role much better. By using XML elements for enumerated types it is possible to later add content inside of the enumerated type (e.g. <f:fileDisposalMethod><f:shredder/></f:fileDisposalMethod> can later be extended to <f:fileDisposalMethod><f:shredder><f:milspecCompliant/></f:shredder></fileDisposalMethod>). Also, by using namespace qualified XML elements for enumerated types one can avoid namespace collisions. This enables 3rd parties to extend the initial list of enumerated types without colliding with anyone else's extensions. It's true that the last feature could be implemented using QNAME strings but I suspect you will find that most XML Schema processors don't handle restriction with a base of QNAME and a set of enumerations correctly.

XML:Lang – The xml:lang attributes allows one to specify the language a string of text is written in. Not only should this attribute always be used but as a general rule all data structures that provide for human readable text should be structured to allow for multiple 'equivalent' strings. The typical example is an element that contains someone's name. Generally such an element would be of the form <name xml:lang="en">Yaron Y. Goland</name>. But in most applications there are good reasons why multiple versions in multiple languages are necessary. As such the element should take the form <nameContainer><name xml:lang="en">Yaron Y. Goland</name><name xml:lang="he">???? ?. ????? </name></nameContainer>.

Make sure to check out David Orchard's page on XML versioning. It provides a lot of material explaining why versioning using XML Schema is essentially impossible.

Thanks to David Orchard for pointing me at http://www.xfront.com/BestPracticesHomepage.html. Thanks for Larry Masinter for pointing me at http://search.ietf.org/internet-drafts/draft-hollenbeck-ietf-xml-guidelines-00.txt. Thanks to Simon Basin for pointing out that my original text on avoiding strings for enumerated types needed fixing.

Leave a Reply

Your email address will not be published. Required fields are marked *