At work, I had the need to get a biggish (65MB) XML file into a DB and out again. To check that this all worked, I wanted to compare the two xml files. For this comparison, it didn’t matter in which order the elements and attributes were, just that all of them were there in the same way as before. The main problem was that the XML file was ~65MB big.
Some xml diff programs I found on the net took more than a night (didn’t wait for them to finish). Stackoverflow had some ideas how to diff xml files and ignore attribute order, but not element order. Given that I didn’t want a readable diff, just some confirmation that the file content was the same, I settled for this pipeline:
- Make a canonical XML all attributes are in the same order
- Pretty print it so there is one element per line and the indention is the same
- Sort it
- Diff the sorted xml
Here is a makefile snippet to do it and show the top 30 lines of the diff:
# Prepare the original file
tmp/orig_sorted.xml: data/orig.xml
mkdir -p tmp/
xmllint --exc-c14n data/orig.xml | xmllint --format /dev/stdin | sort > tmp/orig_sorted.xml
.PHONY: test-export
test-export: tmp/orig_sorted.xml
do-export -o exports/whatever.xml
xmllint --exc-c14n exports/whatever.xml | xmllint --format /dev/stdin | sort > tmp/whatever_sorted.xml
diff -u --speed-large-files tmp/orig_sorted.xml tmp/whatever_sorted.xml > tmp/diff.txt || true
head -n 30 tmp/diff.txt