Adventures in Python optimisation

13 October 2018

A hobby of mine is converting Gutenberg transcriptions into nice ebooks for the Standard Ebooks project. Most recently I’ve been working on Pepys’s Diary which, in the edition I’m working with, is stuffed full of footnotes.

Because footnotes aren’t very meaningful in a format where pages shift, the default approach is to convert them to endnotes. That worked fine for the initial build, but plenty were missing from the Gutenberg transcription I’m using, which means that I’m having to go back in and insert new endnotes, causing all the following ones to be renumbered. Luckily for me, Standard Ebooks’ tooling includes the reorder-endnotes script, but it’s pretty slow. Running it against the current state of the Diary (~1600 endnotes) to increment every endnote by one takes on average 100+ seconds. That’s pretty painful, and was seriously making me consider upgrading my laptop. But first I wanted to see if there was any software optimisations I could make.

Standard Ebook’s tooling is written in Python 3, a language I’m not that familiar with. Having said that, most optimisations are logic based rather than language specific, so the first thing I did was check the hot loops. The first part of the script only works on endnotes.xhtml, so I skipped that as there were unlikely to be major speed optimisations for code that works on a single file. Looking at the second part of the code though I saw two uses of replace, and had a sneaking suspicion. Sure enough, looking at the replace documentation there’s a count argument. We know that there’ll only be one instance of each endnote link in a file, so we can set that to 1 and retime the function. Unfortunately that only shaved a second and a half off the run. Nothing to sniff at but I wanted more. Still, worth committing.

Next up, could we do anything about the loop? There were actually two loops in play, one looping over all the source files, and then looping inside the file for the endnotes. As we’re opening each file individually and running the same inner loop, this looked like a potential case for parallelisation. Now I’m no parallel computing expert, but WebWorkers aren’t exactly difficult so I had a look around in the Python docs for something similar, and hit on concurrent futures. The given examples made it look reasonably simple: import the library, create a ProcessPoolExecutor, then submit your function to the executor as the first thing inside your loop. Easy to test and easy to use, as it turned out to nearly double the speed of the script on my dual-core machine.

Finally, what about the replacement loop? I realised that it’d loop over every possible endnote in the file search to see what it could replace. It needs to do that to start with, but endnotes are in order within a file (and book). If it’s started replacing them already then can’t find one it by necessity won’t find any more in that file and we can break the loop. To my surprise, adding a dirty flag and loop breaking halved the time again. A 75% reduction in function time is nothing to be sniffed at, and I opened a PR to get that into the project.

Now that those were merged into master I carried on working on Pepys. A massive improvement in speed obviously, but as more and more endnotes were added the script started slowing down again. What had been 16s to run was now up to 22s; still an improvement, but it left me wondering if there were any more improvements I could do. Time to dive in the code again.

Looking back at the replacement loop again, I tried disabling each of the three lines. Turning off the two replacement lines lead to a small improvement in speed, but disabling the line with the regular expression dropped the processing time from 22s to 5s! Can we do anything about that code? I was wondering about caching of regular expressions, but it looks like Python does that automatically. The problem though is that it still needs to build a matching regex and a replacement regex for each endnote: at this point over 3000 of them. It turns out that there’s a count flag for regex substitution as well, but this didn’t help much. Was is possible to simplify the regex a bit? Negative sets are pretty painful from a performance point of view, and we were just looking for an a element with attributes. Replacing the negative search for a > with a positive search for a space (\s) knocked a second off, but I wanted more. In the end I managed to replace the regular expression completely with a normal string replacement, and sure enough the performance matches those of the other string replacements: the overall function with ~1600 endnotes now takes around 5s.

Overall I’m super happy with a twenty-times increase in speed. It’ll save me many hours of CPU time over the course of finishing Pepys’s Diary, and have given me a better insight into Python performance. It’s probably worth pointing out that I’m the outlier here: the usual process of making it work, making it correct and making it fast had stopped at correct purely because for most books (with maybe 10-20 endnotes) it was fast enough already. I’m just trying to build a book with many thousands of endnotes on a six year old ultrabook, and if I’m able to extract another year or two out of it I’m more than happy to do a couple of hours of software optimisation.