PolicyStat's Dev Blog

Structure Aware Change Tracking

At PolicyStat, whenever we have written a chunk of code that seems like it might have widespread usefulness, we like to release it as open source. We have recently released HTML Tree Diff, a library for showing diffs between HTML documents in a structure-aware way. It is written in Python, and you can get the source code at GitHub, or install it from the Python Package Index.

We work with HTML documents every day, and we were disappointed that there was not an existing library to display “track-changes” style diffs between HTML documents. This code has been used in production since June 2009, and we’re excited to share it with the community.

Documents

Let’s say you have a document.  Very Important Documenttm. And some Very Important People are interested in what’s in the document. Now, as very important as these people are, they don’t have the time to read through the entire thing each time it gets updated, but they would like to know what exactly the changes are. That’s where diff comes in.

Let me show you an example:

Old New
old document new document

Someone has changed the jelly donut schedule! It’s been the same for years! How will we remember the new one? Everybody panic!

PANIC

Diffs

But wait, through the clever use of technology, you can calm the panic by showing them exactly what changed between the two documents:

inline diff

As you may know, this is a diff. It concisely shows what lines have changed between the successive versions. We can even fancy it up and use html styling to make it more readable:

Source Rendered
source doc rendered doc

HTML Documents

This works pretty well for files that are just flat text, but what if our Very Important Document is in HTML format? It turns out that PolicyStat has exactly this situation. We have tens of thousands of Very Important Documents that are stored in HTML format, and have multiple versions.

So let’s look at what happens when we try the same thing on an HTML document:

Source Rendered
source html rendered html

Disaster! That’s not even valid HTML! The Very Important People are now Very Angry!

HTML Diffs

What do we do about this? It turns out that this is not a trivial problem to solve. You have to consider that HTML is not flat like a text file, but actually a tree structure.

So, to create a diff between two HTML documents, the diff algorithm needs to be aware of the tree structure. There has been some research in this area, but none of it that we found was implemented in a practical way, with real-world usefulness.

To solve this, I wrote a library that does structure aware diffs of HTML trees. Here’s some example output:

Source Rendered
structure-aware diff HTML sturcture-aware diff rendered

The day is saved! Everyone shows up on time for jelly donuts, and the donut rebellion of 2011 is quelled!

Comments