This lesson is being piloted (Beta version)

Whole-array operations


Teaching: 30 min
Exercises: 30 min
  • How can I avoid looping when performing operations across arrays in Numpy?

  • What about when there are conditionals?

  • What about when there are complex array multiplications?

  • Understand the arithmetic operations available across whole arrays in Numpy

  • Be able to use masks and einsum to broaden the range of whole-array operations available

One-dimensional arrays

Let’s start off with an example with one-dimensional arrays, calculating the Euclidean distance between two vectors. We’d like to convert the following function to use Numpy in the most performant way possible:

def naive_dist(p, q):
    square_distance = 0
    for p_i, q_i in zip(p, q):
        square_distance += (p_i - q_i) ** 2
    return square_distance ** 0.5

Let’s test this to get a baseline performance measure:

p = [i for i in range(1000)]
q = [i + 2 for i in range(1000)]

%timeit naive_dist(p, q)

On my machine, this gives:

485 µs ± 365 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, we know that Numpy can do the subtraction and square operations element-wise all at once, rather than requiring an explicit loop. Let’s try that:

import numpy as np

def simple_numpy_dist(p, q):
    return (np.sum((p - q) ** 2)) ** 0.5

To test this, we’ll need to use a Numpy array rather than the lists we made for the previous test, since Python can’t subtract lists.

p = np.arange(1000)
q = np.arange(1000) + 2

%timeit simple_numpy_dist(p, q)
26.7 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We’re already 18 times better! (This will vary from machine-to-machine; the wider your CPU’s vector units, the better this will do.) But can we do better?

In general, it’s best to use the most specific operation that’s available. This is because the more specialised the function, the more of it is implemented in compiled C, and the less time is spent in the Python glue logic. Here we are calculating the difference between the two vectors, and then finding the length of the result. Finding the length of a vector is something you’d imagine would be provided by any good numerical library, and Numpy is no exception.

def numpy_norm_dist(p, q):
    return np.linalg.norm(p - q)

%timeit numpy_norm_dist(p, q)
18.4 µs ± 594 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using the specialised function here has given a 31% improvement over the more generic operations.

Higher dimensionalities

For one-dimensional arrays, translating from naive to whole-array operations is normally quite direct. But when it comes to multi-dimensional arrays, some additional work may be needed to get everything into the right shape.

Let’s extend the previous example to work on multiple vectors at once. We would like to calculate the Euclidean distances between $M$ pairs of vectors, each of length $N$. In plain Python we could take this as a list of lists, and re-use the previous function for each vector in turn.

def naive_dists(ps, qs):
    return [naive_dist(p, q) for p, q in zip(ps, qs)]

We’ll need to generate some multi-dimensional test data here:

ps = [[i + 1000 * j for i in range(1000)] for j in range(1000)]
qs = [[i + 1000 * j + 2 for i in range(1000)] for j in range(1000)]

%timeit naive_dists(ps, qs)
497 ms ± 5.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Half a second per iteration is still not that long in the grand scheme of things, but if this were in an inner loop of a larger program, it could slow things down quite quickly.

Moving this to Numpy, we can subtract as we did previously, but for the summation, we need to be able to tell Numpy to leave us with a one-dimensional array of distances, rather than a single number. To do this, we pass the axis keyword argument, which tells Numpy which axis to sum over. In this case, axis 0 controls which vector we are selecting, and axis 1 controls which element of the vector. Thus here we only want to sum over axis 1, leaving axis 0 still representing the vector of sums.

def simple_numpy_dists(ps, qs):
    return np.sum((ps - qs) ** 2, axis=1) ** 0.5

axis is a keyword argument that is quite important when working with higher-dimensional arrays, so it’s worth taking some time to make sure that the values make sense.

To test the performance of this, we need to generate the same sample data we previously used. To do this, we’ll use Numpy’s reshape function, to turn a long one-dimensional array into a multidimensional one. This is something you will see used very frequently for quickly creating sample data. It splits the array into equal-length slices and stacks them, working left-to-right, top-to-bottom; for the general higher-dimensional case, the indices are filled left-to-right.

An illustration showing a long rainbow-coloured array being divided into four blocks, with the four blocks stacked top-to-bottom.

ps = np.arange(1_000_000).reshape((1000, 1000))
qs = np.arange(1_000_000).reshape((1000, 1000)) + 2

%timeit simple_numpy_dists(ps, qs)
6.7 ms ± 384 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Once again we’ve achieved almost a 100x speedup by switching over to Numpy. What about the norm function we tried previously? This also supports the axis keyword argument.

def numpy_norm_dists(ps, qs):
    return np.linalg.norm(ps - qs, axis=1)

%timeit numpy_norm_dists(ps, qs)

Timing this gives:

10.3 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unlike the 1-dimensional case, using the dedicated norm function provided by Numpy is slower here than the explicit computation! As you can see, it’s always important to test your improvements on data that resemble those that you will be performing your computation on, as the performance characteristics will change from case to case.

Numpy does have one more trick up its sleeve, however…


The problem with these more complex cases is that we’re having to use optional arguments to np.sum and np.linalg.norm. By necessity, Numpy can’t optimise all cases of these general functions as well as it would be able to optimise the specific case that we are interested in.

Numpy does give us a way to express an exact reduction that we would like to perform, and can execute it in a highly optimised way. The function that does this is np.einsum, which is short for “Einstein summation convention”. If you’re not familiar with this, it is a notation used in physics for abbreviating common expressions that have many summations in them.

np.einsum requires as arguments a string specifying the operations to be carried out, and then the array(s) that the operations will be carried out on. The string is formatted in a specific way: indices to look up (starting from i, j, k, …) are given for each array, separated by commas, followed by -> (representing an arrow), and then the indices for the resulting array are given. Any indices not on the right-hand side are summed over. So the operation:

C = np.einsum('ijk,jkl->ijl', A, B)

is equivalent to the matrix (or tensor) operation:

\[C_{ijl} = \sum_{k} A_{ijk} B_{jkl}\]

For example, 'ii->i' gives a one-dimensional array (or a vector) containing the diagonal elements of a two-dimensional array (or matrix). This is because the $i$th element of this vector contains the element $(i, i)$ of the matrix.

In this case, we want to calculate the array’s product with itself, summing along axis 1. This can be done via 'ij,ij->i', although the array will need to be given twice or this to work. Because of this, we’ll need to put the difference array into a variable before passing it into np.einsum. Putting this together:

def einsum_dists(ps, qs):
    difference = ps - qs
    return np.einsum('ij,ij->i', difference, difference) ** 0.5

Timing this:

1000 loops, best of 3: 1.7 msec per loop

Wow! At the expense of some readability, we’ve gained another factor of 2 in performance. Since the resulting line of code is significantly harder to read, and any errors in the np.einsum syntax are likely to be impenetrable, it is important to leave a comment explaining exactly what you are trying to do with this syntax when you use it. That way, if someone (yourself included) comes back to it in six months, they will have a better idea of what it does, and can double-check that it does indeed do this. And if there is a bug, you can check whether the bug is in the design (was the line trying to do the wrong thing) or in the implementation (was it trying to do the right thing, but the np.einsum syntax for that thing was implemented wrong).

np.einsum is best used sparingly, at the most performance-critical parts of your program. Used well, though, it can be a game-changer, since it can express many operations that would be difficult to express any other way without using explicit for loops.

Avoiding conditionals

Sometimes we would like to diverge between two code paths based on some condition. Typically we would do this with an if block. However, to do this on an element-by-element basis requires introducing a loop. Since Python loops are the enemy of performance when working with Numpy, we would like to find an alternative to loops + if blocks that still let us distinguish between different cases.

What we can do in these cases is to use masks. Masks take various forms, but in general they are arrays of boolean (true/false) values that can be used to identify which data to consider and which to discard from a calculation.

For example, consider the peaked data generated by:

import matplotlib.pyplot as plt

x_range = np.arange(-30, 30.1, 0.1)
y_range = np.arange(-30, 30.1, 0.1)
x_values, y_values = np.meshgrid(x_range, y_range, sparse=False)
peaked_function = (np.sin(x_values**2 + y_values**2) /
                   (x_values**2 + y_values**2) ** 0.25)

If we are only interested in the highest points, we can isolate these using the > sign as usual. This gives us back an array of boolean values that is the same shape as the input array:

peaked_function > 0.8
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

Plotting this, we can see that the highest values are a ring around the centre of the distribution:

plt.imshow(peaked_function > 0.8)

This array is an example of a mask. We can use this as its own array; for example, if we want to know how many points of the plot are larger than 0.8, then we can use np.sum(peaked_function > 0.8).

Numpy also gives three ways of using this to perform calculations with the original array.


By multiplying a mask array with the input array, we suppress the unwanted elements of the original array, without changing its shape. For example, removing only the negative elements of the original array:

mask = peaked_function >= 0
plt.imshow(mask * peaked_function)

Fancy Indexing

Numpy also lets you use a mask array as an index. This will pick out only the array elements selected by the mask, discarding the unwanted ones. This gives a one-dimensional array rather than preserving the shape.

For example:

mask = peaked_function > 0.9
print('Values of peaked_function greater than 0.9:')

This can also be assigned to (i.e. used on the left-hand side of an =)! For example, another way to remove the negative elements would be:

peaked_function[peaked_function < 0] = 0

Numpy recognises that this kind of operation is something we will want to do frequently, and gives us a function for it, called np.where(). The above is equivalent to:

peaked_function = np.where(peaked_function < 0, 0, peaked_function)

The first argument is the condition, the second the value to assign if the condition is True, and the third the value to assign if it is False. All three of these can be arrays, provided they can be broadcast together (see the next episode for more details on that).

Masked arrays

Numpy also allows arrays with a built-in mask to be constructed, so that all function calls on these arrays ignore unwanted values (while still leaving the ignored data intact).

mask = peaked_function > 0.9
masked_peaks =, mask=~mask)

Note the ~mask! This is because the mask keyword expects the opposite convention to the one we have been using: True values are dropped, and False values are included. ~ is the logical NOT operator.

Performance comparison

The mean that we calculated using a masked arrays could equally well have been calculated using multiplication or fancy indexing. Implement these two methods, and compare their performance.


With fancy indexing:

mask = peaked_function > 0.9
343 µs ± 8.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With masks:

masked_peaks =, mask=~mask)
2.59 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With multiplication:

mask = peaked_function > 0.9
np.sum(mask * peaked_function) / np.sum(mask)
1.71 ms ± 7.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Calculating Pi

A common example problem in numerical computing is the Monte Carlo computation of $\pi$. The way this works is as follows.

Consider the unit circle, centered on the origin. Now, imagine firing bullets at random locations in the square from the origin to the point (1, 1). The area of that square that overlaps with the circle is $\frac{\pi}{4}$, while the area of the square is 1. So the proportion of bullets that will land in the circle is $\frac{\pi}{4}$.

So, by generating a large number of random points in the unit square, and counting the number that lie within the unit circle, we can find an estimate for $\frac{\pi}{4}$, and by extension, $\pi$.

A diagram illustrating the description above.

A plain Python function that would achieve this might look as follows:

from random import random

def naive_pi(number_of_samples):
    within_circle_count = 0

    for _ in range(number_of_samples):
        x = random()
        y = random()

        if x ** 2 + y ** 2 < 1:
            within_circle_count += 1

    return within_circle_count / number_of_samples * 4

Time this for a million iterations. Check that it gives the correct result for $\pi$.

Try and convert this to Numpy. Think about how you can utilise whole-array operations. Time your result and see how it compares to the plain Python case. How much faster can you get?

You’ll want to use np.random.random for your random numbers. Take a look at the documentation for that function to see what arguments it takes that will be helpful for this problem.


Two possible solutions:

import numpy as np

def numpy_pi_1(number_of_samples):
    # Generate all of the random numbers at once to avoid loops
    samples = np.random.random(size=(number_of_samples, 2))

    # Use the same np.einsum trick that we used in the previous example
    # Since we are comparing with 1, we don't need the square root
    squared_distances = np.einsum('ij,ij->i', samples, samples)

    # Identify all instances of a distance below 1
    # "Sum" the true elements to count them
    within_circle_count = np.sum(squared_distances < 1)

    return within_circle_count / number_of_samples * 4

def numpy_pi_2(number_of_samples):
    within_circle_count = 0

    xs = np.random.random(size=number_of_samples)
    ys = np.random.random(size=number_of_samples)

    r_squareds = xs ** 2 + ys ** 2

    within_circle_count = np.sum(r_squareds < 1)

    return within_circle_count / number_of_samples * 4

While these are competitive with each other in performance, which is the fastest depends on various factors. As always, test your own specific workloads and hardware setup to see how solutions perform on them.


Wikipedia states: The Mandelbrot set is the set of complex numbers \(c\) for which the function \(f_c(z) = z^2 + c\) does not diverge when iterated from \(z=0\) , i.e., for which the sequence $f_c(0)$, $f_{c}(f_{c}(0))$, etc., remains bounded in absolute value.

We can approximate the set numerically by iterating some large number of times and testing for exceeding some threshold. We can then plot the resulting set, including using colour for the number of iterations required to exceed the threshold.

An example of doing this with Numpy arrays but without whole-array operations would be (adapted from JF Puget’s examples:

def mandelbrot(c, maxiter):
    z = c
    for n in range(maxiter):
        if abs(z) > 2:
            return n
        z = z*z + c
    return 0

def mandelbrot_set(xmin, xmax, ymin, ymax, width, height, maxiter):
    real_range = np.linspace(xmin, xmax, width)
    imaginary_range = np.linspace(ymin, ymax, height)
    result = np.empty((width, height))
    for i in range(width):
        for j in range(height):
            result[i,j] = mandelbrot(real_range[i] + 
    return result

(If you’ve not complex numbers done in Python before, 1j is Python’s notation for the imaginary unit.)

You can plot an example view of the resulting fractal with:

plt.imshow(mandelbrot_set(-2, 0.5, -1.25, 1.25, 500, 500, 20).T)

Try to rewrite this using Numpy whole-array operations. How does the performance compare between the two versions?


def mandelbrot_wholearray(c, maxiter):
    output = np.zeros_like(c, dtype=np.int32)
    z = np.zeros_like(c, dtype=np.complex64)
    for n in range(maxiter):
        z = z*z + c
        done = np.abs(z) > 2.0
        c[done] = 0
        z[done] = 0
        output[done] = n
    return output

def mandelbrot_set_wholearray(xmin, xmax, ymin, ymax, width, height, maxiter):
    real_range = np.linspace(xmin, xmax, width)
    imaginary_range = np.linspace(ymin, ymax, height)
    real_parts, imaginary_parts = np.meshgrid(
        real_range, 1j * imaginary_range, sparse=False
    return mandelbrot_wholearray(real_parts + imaginary_parts, maxiter)

Multiple cores?

Many functions in Numpy will try to take advantage of multi-core parallelism in your machine. While this is better than ignoring parallelism completely, it isn’t perfect. In many cases there is no speedup from the parallel implementation; and many functions aren’t parallelised at all.

Something to beware is that if you try and take advantage of parallelism yourself via, for example, GNU Parallel, then Numpy might not be aware of this, and will try and use every CPU core you have available—for every copy of the program that is executing. While the parallelism may not have sped things up much, this will severely slow down all of your programs, since the processor is spending most of its time swapping between different threads rather than doing actual work.

If you do want multi-core parallelism out of your Numpy code, it’s better to do it more explicitly as we’ll explore in a later section.

To disable Numpy’s parallelism, you need to set the environment variable OMP_NUM_THREADS to 1 before launching your program. Within a Jupyter notebook, this can be done by placing the following line magic at the top of your notebook:


Key Points

  • Numpy will broadcast operations to all elements, and has many functions to implement commonly-performed tasks across arrays

  • Conditionals can be recast as array masks, arrays of true/false values

  • Use numpy.einsum for more complex array multiplications and reductions, but only where the performance boost is worth the reduced readability