I need to diff some XML files.
For these particular XML files, order is not important. The XML is being used to contain a set of things, not a list – the order of the elements has no significance. Similarly, the order of the attributes within each element isn’t significant.
For example, for my purposes, these two XML files are equivalent:
<myroot> <mychild id="123"> <fruit>apple</fruit> <test hello="world" brackets="angled" question="answers"/> <comment>This is a comment</comment> </mychild> <mychild id="456"> <fruit>banana</fruit> </mychild> <mychild id="789"> <fruit>orange</fruit> <test brackets="round" hello="greeting"> <number>111</number> </test> <dates> <modified>123</modified> <created>253</created> <accessed>44</accessed> </dates> </mychild> </myroot>
<myroot> <mychild id="789"> <fruit>orange</fruit> <test hello="greeting" brackets="round"> <number>111</number> </test> <dates> <accessed>44</accessed> <modified>123</modified> <created>253</created> </dates> </mychild> <mychild id="123"> <test question="answers" hello="world" brackets="angled"/> <comment>This is a comment</comment> <fruit>apple</fruit> </mychild> <mychild id="456"> <fruit>banana</fruit> </mychild> </myroot>
I needed to compare some large XML files, which have big differences in the order of elements, and I couldn’t find a tool that would do the job. So I wrote a bit of Python to do it for me.
How it works
I cheated.
Diff tools are complex, and I’m in a hurry without time to implement one.
Instead, to compare two of my XML files, my approach is to sort them both so they have a consistent order, and then diff the sorted files using an existing visual diff tool. (On Windows, I prefer vsdiff
from SlickEdit. On Mac, I prefer diffmerge. My approach works with either of these.)
Example
For example, consider the following simple test files:
testA.xml
<myroot> <mychild id="123"> <fruit>apple</fruit> <test hello="world" testing="removed" brackets="angled" question="answers"/> <comment>This is a comment</comment> </mychild> <mychild id="456"> <fruit>banana</fruit> <comment>This will be removed</comment> </mychild> <mychild id="789"> <fruit>orange</fruit> <test brackets="round" hello="greeting"> <number>111</number> </test> <dates> <modified>123</modified> <created>880</created> <accessed>44</accessed> </dates> </mychild> </myroot>
testB.xml
<myroot> <mychild id="789"> <fruit>orange</fruit> <test hello="greeting" brackets="round"> <number>111</number> </test> <dates> <accessed>49</accessed> <modified>123</modified> <created>253</created> </dates> </mychild> <mychild id="123"> <test question="answers" hello="world" brackets="angled"/> <comment>This is a comment</comment> <fruit>apple</fruit> </mychild> <mychild id="456"> <fruit>banana</fruit> </mychild> </myroot>
On Mac, I run:
$ python xmldiff.py diffmerge testA.xml testB.xml
On Windows, I run:
$ python xmldiff.py vsdiff testA.xml testB.xml
Source
The source showing how this works is available in a gist at
gist.github.com/dalelane.
It’s a quick hack to let me compare a handful of files, so it’s not been rigorously tested. But it’s a very simple little tool, and was good enough for my purposes tonight!
Thanks for this Dale, we use a product that has the annoying knack of reordering attributes in its config files when an upgrade is applied, so on an initial diff it looks like a lot of things have changed.
On running xmldiff however, we can see the important things that have changed (down to 3 lines different from about 80!).
One thing to note – I needed to pip install lxml before it would work – this was on a pretty much new install of OS X on the Mac so a clean python install.
Another thank you from me, Dale! I was about to write something similar, but found your elegant and simple solution that did exactly what I was planning to do (and more, I did not plan to directly integrate the diff tool, but why not!). Cheers!
That was quite informative !! I am trying to do something similar too. But the XML files I’m working on are 4-5 Giggs in size, so entire XML file wont fit into memory. Will this method work for them ?? Or do you have any ideas to implement it ??
Hiya – Sorry, no, I didn’t do it in a streaming way. I was in a hurry so just read it and sorted it in memory.