Numba for automatic optimisation

Overview

Teaching: 50 min
Exercises: 25 min

Questions

How can I get performance close to that of machine-code from my hand-written Python code?

How can I target GPUs from Python?

Objectives

We know that due to various design decisions in Python, programs written using pure Python operations are slow compared to equivalent code written in a compiled language. We have seen that Numpy provides a lot of operations written in compiled languages that we can use to escape from the performance overhead of pure Python. However, sometimes we do still need to write our own routines from scratch. This is where Numba comes in. Numba provides a just-in-time compiler. If you have used languages like Java, you may be familiar with this. While Python can’t easily be compiled in the way languages like C and Fortran are, due to its flexible type system, what we can do is compile a function for a given data type once we know what type it can be given. Subsequent calls to the same function with the same type make use of the already-compiled machine code that was generated the first time. This adds a significant overhead to the first run of a function, since the compilation takes longer than the less optimised compilation that Python does when it runs a function; however, subsequent calls to that function are generally significantly faster. If another type is supplied later, then it can be compiled a second time.

Numba makes extensive use of a piece of Python syntax known as “decorators”. Decorators are labels or tags placed before function definitions and prefixed with @; they modify function definitions, giving them some extra behaviour or properties.

First example of Numba

(Adapted from the 5-minute introduction to Numba.)

from numba import jit
import numpy as np

@jit(nopython=True)
def a_plus_tr_tanh_a(a):
    trace = 0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

Some things to note about this function:

The decorator @jit(nopython=True) tells Numba to compile this code in “no Python” mode (i.e. if it can’t work out how to compile this function entirely to machine code, it should give an error rather than partially using Python)
The function accepts a Numpy array; Numba performs better with Numpy arrays than with e.g. Pandas dataframes or objects from other libraries.
The array is operated on with Numpy functions (np.tanh) and broadcast operations (+), rather than arbitrary library functions that Numba doesn’t know about.

To time this, it’s important to run the function once during the setup step, so that it gets compiled before we start trying to time its run time. Save the script as first_numba.py and try the following:

$ python -m timeit --setup='import numpy as np; \
    from first_numba import a_plus_tr_tanh_a; \
    a = np.arange(10000).reshape((100, 100)); \
    a_plus_tr_tanh_a(a)' \
    'a_plus_tr_tanh_a(a)'

100000 loops, best of 5: 3.2 usec per loop

How does this compare with the naive version? Commenting out the @jit decorator in first_numba.py an re-running the same timing command:

2000 loops, best of 5: 133 usec per loop

This is a 42x speedup. Not quite as fast as we got in some of our Numpy optimisations, but in the same ballpark.

It might be possible to rearrange this function so that it uses pure Numpy operations throughout rather than a regular Python loop, but in many cases it either isn’t possible or significantly reduces the readability of the code. In these cases, Numba can provide an alternative route to a comparable level of performance, with a lot less work, and more readable code at the end of it.

Getting parallel

The @jit decorator accepts a relatively wide range of parameters. One is parallel, which tells Numba to try and optimise the function to run with multiple threads. Like previously, we need to control this threads count at run-time; this time using the NUMBA_NUM_THREADS environment variable.

Editing first_numba.py to add the parallel=True option to the @jit decorator and re-running with two threads:

$ NUMBA_NUM_THREADS=2 python -m timeit --setup='import numpy as np; \
    from first_numba import a_plus_tr_tanh_a; \
    a = np.arange(10000).reshape((100, 100)); \
    a_plus_tr_tanh_a(a)' \
    'a_plus_tr_tanh_a(a)'

20000 loops, best of 5: 16.5 usec per loop

Parallelism has successfully multiplied our run time by 4. This is because when running in parallel, Numba (in fact, the OpenMP runtime) needs to spin up a team of threads to run the code, and then keep them synchronised. This takes a finite amount of time, so on very small functions like the one we’ve run here it takes longer than the time saved by running in parallel.

Re-running this with a larger matrix size:

$ for NUM in 1 2 3 4 6 8 10
> do
> NUMBA_NUM_THREADS=${NUM} python -m timeit --setup='import numpy as np; \
    from first_numba import a_plus_tr_tanh_a; \
    a = np.arange(1000000).reshape((1000, 1000)); \
    a_plus_tr_tanh_a(a)' \
    'a_plus_tr_tanh_a(a)'
> done

loops, best of 5: 644 usec per loop
loops, best of 5: 596 usec per loop
loops, best of 5: 266 usec per loop
loops, best of 5: 323 usec per loop
loops, best of 5: 218 usec per loop
loops, best of 5: 160 usec per loop
loops, best of 5: 151 usec per loop

It’s important to check the case with no parallelisation, and then multiple different parallelisations, to see whether parallelisation has given any performance boost at all, and an idea of where the optimal performance is found.

In this case, 10 threads gives the best time per call to the function, but in terms of CPU cost it is quite expensive. 151 microseconds times 10 CPU cores is 1510 core-microseconds per call, compared to 644 for the serial version. The 3-core version however, at 798 core-microseconds, is only a little less efficient than the serial version, and is still almost three times faster (compared to the 10-core version’s slightly better than four times).

Universal functions

(Adapted from the Scipy 2017 Numba tutorial)

Recall how in Numpy there are functions that act on whole arrays, element-wise. These are known as “universal functions”, or “ufuncs” for short. Numpy has the facility for you to define your own ufuncs, but it is quite difficult to use. Numba makes this much easier with the @vectorize decorator. With this, you are able to write a function that takes individual elements, and have it extend to operate element-wise across entire arrays.

For example, consider the (relatively arbitrary) trigonometric function:

import math

def trig(a, b):
    return math.sin(a ** 2) * math.exp(b)

Let’s put this function in a file, trig.py. If we try calling this function on a Numpy array, we correctly get an error, since the math library doesn’t know about Numpy arrays, only single numbers.

However, if we use Numba to “vectorize” this function, then it becomes a ufunc, and will work on Numpy arrays!

import numpy as np
from numba import vectorize
from trig import trig

a = np.ones((5, 5))
b = np.ones((5, 5))

vec_trig = vectorize()(trig)

vec_trig(a, b)

How does the performance compare with using the equivalent Numpy whole-array operation? Adding the following lines to trig.py:

import numpy as np
from numba import vectorize

vec_trig = vectorize()(trig)

def numpy_trig(a, b):
    return np.sin(a ** 2) * np.exp(b)

$ python -m timeit --setup='import numpy as np; \
    import trig; \
    a = np.random.random((1000, 1000)); \
	b = np.random.random((1000, 1000))' \
	'trig.vec_trig(a, b)'
$ OMP_NUM_THREADS=1 python -m timeit --setup='import numpy as np; \
    a = np.random.random((1000, 1000)); \
	b = np.random.random((1000, 1000))' \
	'trig.numpy_trig(a, b)'

Numba: 10 loops, best of 5: 31.2 msec per loop
Numpy: 50 loops, best of 5: 5.17 msec per loop

So Numba isn’t quite competitive with Numpy in this case. But Numba still has more to give here: notice that we’ve forced Numpy to only use a single core. What happens if we use ten cores with Numpy?

$ OMP_NUM_THREADS=10 python -m timeit --setup='import numpy as np; \
    a = np.random.random((1000, 1000)); \
	b = np.random.random((1000, 1000))' \
	'trig.numpy_trig(a, b)'

50 loops, best of 5: 5.56 msec per loop

Numpy doesn’t parallelise this particular operation very well. But Numba can also parallelise. If we alter our call to vectorize, we can pass the keyword argument target='parallel'. However, when we do this, we also need to tell Numba in advance what kind of variables it will work on—unlike with @jit, it can’t work this out and then parallelise. So our vectorize call becomes:

vec_trig = vectorize('float64(float64, float64)', target='parallel')(trig)

This tells Numba that the function accepts two variables of type float64 (8-byte floats, also known as “double precision”), and returns a single float64. Adding this into trig.py and re-running the timing gives:

$ NUMBA_NUM_THREADS=10 python -m timeit --setup='import numpy as np; \
    a = np.random.random((1000, 1000)); \
	b = np.random.random((1000, 1000))' \
	'trig.vec_trig(a, b)'

50 loops, best of 5: 4.26 msec per loop

This has used 42.6 core-milliseconds, so is pretty efficient parallelisation!

How parallel can this go?

The previous example parallelises almost linearly to 10 cores. We have 40 cores available in a node, so can Numba use all of these efficiently?

We’ve been running on a 10-core interactive session, so will need to submit a job to test 40 cores.

Write a job script to submit the above timeit command to run on 40 cores on a single node. (You can request 1 node in --exclusive mode to get this.) How does the timing compare with 10 cores?
Solution

I see:
200 loops, best of 5: 1.32 msec per loop
Compared to 31.2ms on one core, this is 59% efficient, so not perfect but not bad by any stretch. It’s certainly better than Numpy.

Another ufunc

Try creating a ufunc to calculate the discriminant of a quadratic equation, $\Delta = b^2 - 4ac$. (For now, make it a serial function.)

Compare the timings with using Numpy whole-array operations in serial. Do you see the results you might expect?
Solution
@vectorize
def discriminant(a, b, c):
    return b**2 - 4 * a * c
Timing this gives me 3.73 microseconds, whereas the b ** 2 - 4 * a * c Numpy expression takes 13.4 microseconds—almost four times as long. This is because each of the Numpy arithmetic operations needs to create a temporary array to hold the results, whereas the Numba ufunc can create a single final array, and use smaller intermediary values.

Programming GPUs

We don’t have time to look at this in detail, but an example of how GPUs can be programmed with Numba:

from numba import vectorize

@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

More information on programming your GPU with Numba can be found at this tutorial given at the GPU Technology Conference 2018.

Key Points

Use the @jit decorator to just-in-time compile Python functions with Numba

Pay attention to the kinds of object and operation you use in your functions to allow Numba to optimise them

previous episode

Python for High Performance Computing

next episode

Numba for automatic optimisation

Overview

First example of Numba

Getting parallel

Universal functions

How parallel can this go?

Solution

Another ufunc

Solution

Programming GPUs

Key Points

previous episode

next episode