Slow numerical Python? Numba to the rescue!
It is no secret that Python is a slow language, and that the only reason it is viable as a programming language for numerical computing is because of the low level implementation of the numerical routines (thanks numpy
people).
However, it sometimes happen that it is not easy (or even impossible) to properly write your algorithm in numpy-compatible form. In such cases, when you need a lot of loops and conditions, it is very easy to lose yourself. Should I write some c/c++ and interface it to python (terrible idea)? Should I write some low level code in Cython (Cython is awesome but needs some work to set up at the beginning)? Should I wait for compatible JIT Python system like Pyston (I have high expectations for this project but it is way too early, and it does not even support Python3…)?
A possible solution might come from Numba, a Numpy aware compiler for numerical-oriented Python code.
The Algorithm
A relevant case happened to me when I was incorporating the simple and elegant image search algorithm based on integral images published at ICLR 2016.
The basic idea is simple, you take a CNN feature map (for instance pool5
of the VGG16 network). You have a feature map that is of shape \((N_{filters}, H, W)\). Given a region, the descriptor is obtained by summing over the spatial dimensions of the feature map on that region, giving a \(N_{filters}\) dimension vector, that can directly be compared.
Making the integral image out of it is relatively easy :
Then having a function giving back the normalized integral of a region is :
Note : here, for simplicity, I just consider the sum of the region directly with the integral image. The original algorithm uses a variation allowing a max-pool approximation instead of a sum-pool.
Now, if you want to find the best matching region in your image given a query vector, relatively straight-forward indeed :
Note : again, this is a huge simplification of the original algorithm which uses a branching algorithm to avoid evaluating all possible windows like we are doing here.
Numba magic
So the previous implementation is not slow, but definitely limiting if you want to evaluate many potential images against your query vector. Taking roughly 90ms. But here comes the Numba magic, just adding a function annotation telling it to be compiled brings the timing to 61.5ms already.
And then, adding the same decoration to the search_one_integral_image
will bring it to 16ms, more than 5X faster with two lines!!
Additionally, the call to the function search_one_integral_image
now releases the Global Interpreter Lock, which allows for easier parallelism. For instance, the following code will actually be able to naturally use more than 1 CPU which is sometimes a challenge in numerical-heavy python :
All in all, if you have never used Numba, give it a try, it does not solve everything but can be every practical in many situations.