Fuzzy Pointers - pointing into SGML or invalid documents.

Version 1
25th April 2005.

Jim Ley, Jibbering.com


XPointer is very successful at identifying content in valid XML documents, however it's less successful at measuring content in HTML documents read by real world parsers, where there are optional elements and lots of validity bugs.

Fuzzy Pointers

A solution to this is the Fuzzy Pointer, this uses a subset of the elements in a document to construct a pointer, and uses the heirachy of elements constructed even from an invalid DOM as a basis. Obviously these methods are less accurate than XPointers in XML content as there's the opportunity for ambiguity. However in most cases the pointer will be unambiguous and interoperable between implementations

A fuzzy pointer looks exactly like an XPointer /html/body/table[2]/tr[1]/td[0]/img[@src='moomin.gif'] but unlike an XPointer, not all elements may be included - for example in the example, a tbody would commonly be found before the tr, however, not all implementations insert tbody's in tables, some leave it out. For this reason it's left out of the XPointer, a user agent consuming the XPointer knows to ignore tbody in its tree (if it exists) and just count tr's.

The choice of which elements to ignore is the next problem, there can be no perfect solution, in the original testing [try and find the rest of the discussion!], OpenSP, IE and Mozilla all constructed similar trees for the content of the body, even on invalid content, tbody being the only difference we could find. The head was more different, the conclusion was that if you included /body/ as the start of the pointer, then you only needed to leave out tbody, to get a mostly interopable pointer. Pointing into the head of an invalid HTML document was much less reliable, and no reliable interopable pointer was found.

As well as the full pointer using every interopable element in the page, we also looked at limited pointers defined over an even smaller subset of elements in the page, perhaps just headings and paragraphs when pointing out spelling mistakes, or just images when talking about missing ALT attribute. The problem with these is communicating the different types of pointers being used between implementations. For that reason we should probably look at just defining the basic pointer and possibly a couple more.

The fuzziness sacrifices some reliability in identifying an element, at the advantage of the pointer surviving more changes in the document - for the Annotea project this was important. The problem then was detecting if the pointer was still valid, here we looked at various hashes of the document.

Document Hashes

Storing a simple hash of the document allows us to detect if the document has changed, a hash of the whole document however will change if even the spelling of one word changes. The alternative is to store hashes just of parts of the content, such as the structure. So we store a hash of the same elements we use in constructing the fuzzy pointer, we can tell if the fuzzy pointer is likely to still be valid - not definately of course as it could be a completely different document but with a different structure. Nick Kew's server implementation of this is sadly unavailable at the moment, my is still available, but not well written up.