Comparing XML files ignoring order of attributes and child elements

I need to diff some XML files.

For these particular XML files, order is not important. The XML is being used to contain a set of things, not a list – the order of the elements has no significance. Similarly, the order of the attributes within each element isn’t significant.

For example, for my purposes, these two XML files are equivalent:

<myroot>
    <mychild id="123">
        <fruit>apple</fruit>
        <test hello="world" brackets="angled" question="answers"/>
        <comment>This is a comment</comment>
    </mychild>
    <mychild id="456">
        <fruit>banana</fruit>
    </mychild>
    <mychild id="789">
        <fruit>orange</fruit>
        <test brackets="round" hello="greeting">
            <number>111</number>
        </test>
        <dates>
              <modified>123</modified>
              <created>253</created>
              <accessed>44</accessed>
        </dates>
    </mychild>
</myroot>
<myroot>
    <mychild id="789">
        <fruit>orange</fruit>
        <test hello="greeting" brackets="round">
            <number>111</number>
        </test>
        <dates>
              <accessed>44</accessed>    
              <modified>123</modified>
              <created>253</created>
        </dates>
    </mychild>
    <mychild id="123">
        <test question="answers" hello="world" brackets="angled"/>
        <comment>This is a comment</comment>
        <fruit>apple</fruit>
    </mychild>
    <mychild id="456">
        <fruit>banana</fruit>
    </mychild>
</myroot>

I needed to compare some large XML files, which have big differences in the order of elements, and I couldn’t find a tool that would do the job. So I wrote a bit of Python to do it for me.

How it works

I cheated.

Diff tools are complex, and I’m in a hurry without time to implement one.

Instead, to compare two of my XML files, my approach is to sort them both so they have a consistent order, and then diff the sorted files using an existing visual diff tool. (On Windows, I prefer vsdiff from SlickEdit. On Mac, I prefer diffmerge. My approach works with either of these.)

Example

For example, consider the following simple test files:

testA.xml

<myroot>
    <mychild id="123">
        <fruit>apple</fruit>
        <test hello="world" testing="removed" brackets="angled" question="answers"/>
        <comment>This is a comment</comment>
    </mychild>
    <mychild id="456">
        <fruit>banana</fruit>
        <comment>This will be removed</comment>
    </mychild>
    <mychild id="789">
        <fruit>orange</fruit>
        <test brackets="round" hello="greeting">
            <number>111</number>
        </test>
        <dates>
              <modified>123</modified>
              <created>880</created>
              <accessed>44</accessed>
        </dates>
    </mychild>
</myroot>

testB.xml

<myroot>
    <mychild id="789">
        <fruit>orange</fruit>
        <test hello="greeting" brackets="round">
            <number>111</number>
        </test>
        <dates>
              <accessed>49</accessed>    
              <modified>123</modified>
              <created>253</created>
        </dates>
    </mychild>
    <mychild id="123">
        <test question="answers" hello="world" brackets="angled"/>
        <comment>This is a comment</comment>
        <fruit>apple</fruit>
    </mychild>
    <mychild id="456">
        <fruit>banana</fruit>
    </mychild>
</myroot>

On Mac, I run:
$ python xmldiff.py diffmerge testA.xml testB.xml

And get:
Screen Shot 2014-10-06 at 02.48.02

On Windows, I run:
$ python xmldiff.py vsdiff testA.xml testB.xml

And get:
screenshot-windows-20141006-0255

Source

The source showing how this works is available in a gist at
gist.github.com/dalelane.

It’s a quick hack to let me compare a handful of files, so it’s not been rigorously tested. But it’s a very simple little tool, and was good enough for my purposes tonight!

Tags: , ,

4 Responses to “Comparing XML files ignoring order of attributes and child elements”

  1. Adrian Thomson says:

    Thanks for this Dale, we use a product that has the annoying knack of reordering attributes in its config files when an upgrade is applied, so on an initial diff it looks like a lot of things have changed.

    On running xmldiff however, we can see the important things that have changed (down to 3 lines different from about 80!).

    One thing to note – I needed to pip install lxml before it would work – this was on a pretty much new install of OS X on the Mac so a clean python install.

  2. Chris says:

    Another thank you from me, Dale! I was about to write something similar, but found your elegant and simple solution that did exactly what I was planning to do (and more, I did not plan to directly integrate the diff tool, but why not!). Cheers!

  3. Sahil Sethi says:

    That was quite informative !! I am trying to do something similar too. But the XML files I’m working on are 4-5 Giggs in size, so entire XML file wont fit into memory. Will this method work for them ?? Or do you have any ideas to implement it ??

  4. dale says:

    Hiya – Sorry, no, I didn’t do it in a streaming way. I was in a hurry so just read it and sorted it in memory.