1. Feb 1st, 2007

    Friendly XML, trusted XML and parity bits

    Anne van Kesteren: XML with graceful error handling?

    Yes, please.

    Tim Bray has some doubts about the practicality:

    There’s a spectrum of situations: at one end, if an electronic-trading system receives an XML message for a transaction valued at €2,000,000, and there’s a problem with a missing end tag, you do not want the system guessing what the message meant, you want to report an error. At the other end, if someone sends a blog post from their cellphone with a picture of a cute kitten, you don’t want to reject it because there’s an “&” in the wrong spot. The world is complicated.

    My first thought was to suggest having two conformance levels. One that requires the XML to be well-formed, valid and all the other good stuff. And one that doesn’t. We have the technology. We can do HTML and XHTML, and spot HTML that’s pretending to be XHTML. We just need to work that knowledge back into the spec.

    But then it struck me. Occassionally, I will bump into an XML parser that has difficulties with non-ASCII text, or that doesn’t do entity expansion, or has its own interpretation for the more obscure parts of Namespaces in XML. Well-formed and valid are only good enough if you implement the specs correctly.

    What if, instead, you take your data in its canonical form, hash it, and send that hash along with the XML. At the other hand, you read the XML as best as you can understand it, hash its canonical form, and compare it. You might find that both ends agree on the document, but after differences of opinion over the content encoding and some namespace gymanstics, disagree on what the document says.

    Think of it as an XML parity bit.

    1. Feb 1st, 2007

      Eran

      Hashing XML files is a very delicate process.
      Just look at the XML Signature specs which has a section about hashing an XML document (or part of it) for change verification purposes.

      There are a couple of problems there:
      1) Encoding, some implementation will not handle various encoding correcting and will hash something that is not the exact coding thus producing a wrong hash.
      2) Whitespaces – the holy pain the butt for XML – some will hash with whitespaces, some without.

      Even hashing this simply plain crap is going to be a nightmare which might suffer from the same stuff every other XML document has.

      Then again, good idea though… It’s not that I’m just seeing the half empty glass :-)

    2. Feb 1st, 2007

      Aristotle

      The problem will be defining a canonical form that everyone agrees and which actually looks like what any application would actually use directly. If you define a canonical form that’s just another intermediate format it might help a little, but you won’t really gain that much.

    3. Feb 1st, 2007

      Assaf

      Eran,

      Canonical XML (which XML sig uses) is UTF-8 encoded. So for hasing purposes, there’s only one encoding to get right. And if you use any other encoding for the actual document, it verifies you got the implementation right (or wrong, but consistently on both ends).

      Whitespace is a bit trickier, but that too is solvable.

      The point is to make a strong assertion that two implementations are working on the same data, at least in the XML sense of it. Then you can relax well-formed to be just a means to that end, and turn it off in some implementations.

    Your comment, here ⇓