Profiling to identify bottlenecks
Overview
Teaching: 20 min
Exercises: 15 minQuestions
Why is it important to know which parts of my software run slowly?
How do I find out where to target optimisation effort?
Objectives
Be able to identify what functions and lines your program spends the most time on
Use this information to guide approaches to optimisation
For the remainder of this workshop, we will be looking at ways to dig deep into our Python code and adjust and tune it so that it goes faster. Many of these methods take some amount of time to implement, and may make the source code more difficult to read or maintain, so it is important that we target optimisation where it is going to make a difference—both so that we don’t waste our time, and so that we don’t make our software harder to maintain in future.
For example, if you are working with a piece of simulation software that performs a simulation and then plots the output, then the simulation is probably going to take a lot longer than the plot. For the sake of argument, say that the simulation is 99% of the time, and the plot is the remaining 1%. No matter how much effort we spend optimising the performance of the plotting section, even if we make it run in no time whatsoever, we can only reduce the run time of the software as a whole by 1%. Instead, a moderate optimisation of the simulation part that decreased its run time by only 10% would decrease the run time of the program as a whole by 9.9%—almost 10 times better than the really difficult plotting optimisation.
Because of this, it’s essential to know exactly where your software is spending its time before you start spending time on optimising it. Now, depending on your relationship with the software you are trying to optimise, you may have some idea of where the bottlenecks are, and where the software spends the bulk of its time. If you wrote the software from scratch for example, you might have some strong ideas about where to start looking. However, software has a nasty tendency to bely our expectations about performance. Even domain experts will swear up and down that they guarantee their software is spending over 90% of its time in function X, only to be confronted with hard data showing the program spending over 30% of its time in function Y, in an entirely separate bit of code. This could be for any number of reasons—perhaps the knowledge is based on a different computer generation, with a different balance of performance of its internal components, or perhaps it was based on a previous version of the software, before an optimisation was found that trebled the speed of the function that previously dominated.
No matter what you already know, it’s crucial to have some good data before starting optimisation. We need to use tools to help with this. A piece of software that tells us where our software is spending its time is called a profiler. A wide range of profilers is available on the market, some free and some commercial, and each with its own strengths and weaknesses.
Using cProfile
at the command line
Fortunately, Python comes with a profiler built in, which gives us more than enough information to get us started optimising our software.
cProfile
is a built-in Python module, which can be run like a
program. It sits between the Python interpreter and your Python
program, intercepting each operation that Python does, and counting it
up for later analysis. Since it is a C extension (i.e. it is compiled,
not interpreted), it can do this while adding a minimum of overhead to
your Python programs.
Let’s start by looking at mc.py
from the code package you copied
earlier. This performs a small Monte Carlo simulation of a single
variable, saving the generated Markov Chain to a file, as well as some
metadata on what happened at each step. To run it without profiling,
type:
$ python mc.py 0 1.0 0.1 1000000 mc.dat
This tells the program to start at zero, use a temperature of 1.0, a
step size of 0.1, to perform a million iterations, and to write the
resulting chain to the file mc.dat
.
Choosing your problem size
It’s important that the test that you profile runs long enough that it is not dominated by the time it takes to start the Python interpreter and load all of the libraries to run your program. Conversely, it’s important for it not to run for so long that you have to walk away and get a coffee each time you make a change, since you’ll want to run your profiling regularly as you change your program. The iteration count of 1,000,000 here is large enough that the program takes a few seconds to run. This is longer than most overheads, but short enough as to not be bothersome.
In your own applications, you may not be able to get the run time this low, but you want to aim for only a couple of minutes. If your current test case takes an hour to run, then think about assembling a smaller test case&emdash;while this will take some of your time, you would lose a lot more time waiting for your long test case to run every time you need an updated profile.
To use cProfile
to generate a profile for this program, add -m
cProfile
after python
and before the program name. (Note the capital
P
in cProfile
; without it, you will get an error.)
$ python -m cProfile mc.py 0 1.0 0.1 1000000 mc.dat
This will output a long list of functions that are called during the
program’s execution, most of which you never see because they are
internal to Python. Worse, most of the functions take no time to run
at all, or at least less than shows up on the output from cProfile
.
We’d like to get an overview of the top hotspots where the program is
spending its time. To do this, we’ll tell cProfile
to sort by time;
for good measure, we’ll also use the head
command to only show the
first twenty lines of output.
$ python -m cProfile -s time mc.py 0 1.0 0.1 1000000 mc.dat 2> /dev/null | head -n 20
What the head?
If the
| head
syntax is not familiar, then you may want to take a look at the Pipes and Filters episode of the Software Carpentry tutorial on the Unix Shell. For now, you can take it on trust that adding| head -n 20
to the end of a command will make it output only the top 20 lines.The
2> /dev/null
addition tells Python to ignore any errors it encounters.cProfile
by default doesn’t behave nicely withhead
, so we tell it to suppress the error output.
You’ll now see a one-line report of what the program did (specifically, call about 7.5 million functions in a few seconds), and then a list of functions that got called in descending order of time taken.
Taking a cursory look through these, the first call is the Metropolis
update algorithm, which is where we might expect a Metropolis Monte
Carlo program to spend its time. The second longest time however is
spent in the writerow
method of _csv.writer
objects. This program
does indeed write each line of output to disk using a CSV
writer—apparently this is quite expensive. Since this isn’t the
core activity that this program is supposed to be doing, and is taking
a significant chunk of its runtime, then this might be a good target
for optimisation.
Saving time on writes
How might one go about improving the performance of writing to CSV in this kind of program? Discuss your thoughts in groups of two or three.
If you’re confident you’ve found a solution that will work, try implementing it in this program.
Solution
The expense in going to disk is that the file has to be locked and unlocked safely each time that it is written to. The volume of data isn’t a problem—this program only writes 22MB of data, which shouuld take a lot less than two seconds— only the number of individual writes (1,000,001) is.
To reduce this number, one option is to group writes together, so that the program keeps successive results in memory, only outputting to disk every 100 or 1000 elements. This reduces the number of calls to
_csv.writerow
by a factor of 100 or 1000, which should make things run more snappily.
Other sort options
Other options for sorting the
cProfile
output are:
calls
(call count)cumulative
(cumulative time)cumtime
(cumulative time)file
(file name)filename
(file name)module
(file name)ncalls
(call count)pcalls
(primitive call count)line
(line number)name
(function name)nfl
(name/file/line)stdname
(standard name)time
(internal time)tottime
(internal time)Try profiling
mc.py
again using thecumtime
sort option. What is the difference between this and thetime
sort option we previously used?Solution
cumtime
includes the time taken within function calls, whiletime
only includes the time taken within the function body itself. This means that withcumtime
, the main program will always take the longest, and this time will be equal to the time that the program executed for, whereas any function might appear at the top of the list fortime
.
Viewing cProfile
results graphically
Frequently when displaying results as plain text, or in a table, we don’t see the full picture; for this reason, we tend to utilise graphs and other visualisations when viewing the results of computations. Profiling works exactly the same way—while the text-mode profile is very useful in itself, we can get more information out of the profile by using a visualisation tool.
While cProfile
doesn’t itself include any visualisation tooling, it
will save its output into a file so that other tools can display it.
SnakeViz is one tool that we can use to look at the profile data from a different angle. To start, re-run the program, but now save the profile to a file rather than outputting it to the display.
$ python -m cProfile -o mc.prof mc.py 0 1.0 0.1 1000000 mc.dat
This will create a file called mc.prof
, containing the profiling
data. Now, since displaying graphics from the cluster on your own
machine isn’t always easy, instead we’ll copy the profile to our local
machine to view. This can be done with FileZilla, or at a Bash prompt.
To do this at the shell, open a new terminal (running on your own machine), and run the command:
$ # This runs on your computer, not on the supercomputer
$ scp s.your.username@sunbird.swansea.ac.uk:high-performance-python-examples/mc.prof ~/Desktop/
Now we can install SnakeViz and visualise the profile:
$ # This should also happen on your own computer
$ pip install snakeviz
$ snakeviz ~/Desktop/mc.prof
This should automatically open a web browser window with the profile result. If the window doesn’t open, the terminal output will indicate where you can point your browser to load the result. Once the result is loaded, you can press Ctrl+C in the terminal window to stop SnakeViz.
The results show the same data that was in the full profile we saw initially, but now it is organised so that it is much easier to read what times include each other. You can also see what sort of time the many tiny function calls in the original profile add up to. For this small example, the overhead of starting a Python interpreter and loading the required modules is still quite large compared to the total run time, so if we wanted to run many such examples, we would gain speed by using Pathos rather than GNU Parallel—see this aftenroon’s session.
Using timeit
for quick and dirty timings
Sometimes you don’t need a full profile; all you’re interested in is
how long one particular operation takes. While you could use the Unix
time
function to get this for you, Python provides a more precise
alternative. The timeit
module will run a particular expression or
line of code repeatedly, to get a good estimate of how long it takes.
This removes the overhead of starting Python from the problem.
To use timeit
at the command line, use python -m timeit
followed
by the expression you want to time. For example,
$ python -m timeit 'x = [i ** 2 for i in range(1000)]'
will measure how long it takes to calculate a particular list comprehension.
List comprehension?
If you’ve not encountered a list comprehension before, this syntax provides a one-line way to construct a list. The result is equivalent to the loop:
x = [] for i in range(1000): x.append(i ** 2)
However, the list comprehension will normally be quicker.
The output will tell you how many times Python ran the code to get an average:
1000 loops, best of 3: 265 usec per loop
If you want to test an operation that involves a library function, you
can import that library outside of the timed region. (Since normally
you load libraries once at the start of a program, but use a function
a lot, you are not normally interested in the load time of the
library.) To do this, pass the operations to run before the timed
region with the --setup
option. For example, to look at the
metropolis
function from the Monte Carlo code above:
$ python -m timeit --setup 'import mc' 'mc.metropolis(1.0, 0.1, 1.0)'
Comparing the loop with the list comprehension
We asserted above that the list comprehension would be quicker than the loop. But at the top of this episode we said that we should test our assertions about performance, not rely on common wisdom. Check how the run time of the loop version of the list comprehension code compares.
Solution
On my computer, the list comprehension takes 265 microseconds per iteration, while the loop takes 312 microseconds—a whopping 18% slower!
timeit
magic in Jupyter notebooksYou can use
timeit
within a Jupyter notebook to test the performance of code you are writing there, too. In a new cell, use%timeit
followed by the function or expression you want to time, and use%%timeit
at the top of a cell to time the execution of the entire cell.If you have Jupyter installed on your machine, open a new notebook and try this now for the list comprehension and loop comparison. Do the timings match up with what you see at the command line?
Key Points
Use the
cProfile
module to get an overview of where your program is spending timeSave the results to a file and copy them to your machine to analyse them visually
Check profiles again after optimisation to see whether it was successful