People often ask me about improving the performance of their Python code. They even sometimes complain about Python being slow overall. This complaint often stems from the fact that people tend to use Python quite naively, without realizing the performance impact of certain constructs.

While optimization and performance tuning of existing code is important, there are some things to keep in mind while writing Python code. These things will let you avoid your code being unnecessarily slow in most cases.

In this post, we're going to take a bit of a dive, and look into how Python executes code; into functions and functional programming through the lens of performance.

Pythonically Opinionated

In my mind, you can think of Python as "opinionated" in terms of programming style. Some things are regarded as idiomatic ("Pythonic"), and others -- are not. The reasons for this are twofold: code style and performance.

This philosophy is also quite well explained by the following line from the Zen of Python:

There should be one-- and preferably only one --obvious way to do it.

Explaining what is Pythonic, people will tell you about the importance of being idiomatic, about things like iteration, context managers, the object protocol etc, but they tend to neglect performance.

I think the essence of why you should care is this:

When you learn the Python's way of doing things (the "one and preferably only one obvious way"), you not only make your code neater and more concise, you also usually make it faster.

This is because the most common use cases have been carefully thought through and optimized by the Python contributors and the community. Things like string manipulation, regular expressions, I/O operations, dict operations, comprehensions etc... are actually quite fast in Python, but you just have to use the right constructs (or libraries e.g. NumPy for numerical operations).

For ins and outs of what constitutes idiomatic Python, I refer to the excellent book "Fluent Python" by Luciano Ramalho and, of course, the Python documentation.

Throughout this post, I will be exploring the performance of functions and functional constructs, referring to Python implementation details where appropriate. Let's get into it!

Function call overhead

Let's begin by making the following observation: function calls in Python are not very fast. Don't get me wrong, this doesn't mean that you should not use functions at all, quite the contrary. The point is that function calls alone can take non-trivial amount of time and that you should be aware of this.

To prove my point, let's perform a quick experiment. We'll measure how long it takes to call an empty function.

(In case you're wondering, the %timeit "magic" function will only work in IPython shell. I really recommend replacing your vanilla Python shell with IPython for interactive experiments. You can also set it as the default shell for IDEs like PyCharm. Using IPython will make the code below copy-pasteable, as it intelligently skips >>>).

>>> def nothing():
>>>    pass
>>> %timeit nothing()
# 36.9 ns ± 0.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

For an interesting comparison, a single dict lookup takes roughly 20 ns on my machine:

>>> d = {1: 2}
>>> %timeit d[1]
# 20.2 ns ± 0.134 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

You'd think a dict lookup is much more complicated than an empty function call, and should be slower. The cost of calling an empty function is actually quite high!

Let's explore this further. This time, let's use a method (dict.get) to retrieve the value under key 1.

>>> d = {1: 2}
>>> %timeit d.get(1)
# 35.3 ns ± 0.193 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Do you see a pattern emerging? Function (or method) call is a bit slower than "native" dict access. Let's get into the technical details after exploring a few more examples.

Functional? Imperative? Pythonic?

def process_functional(values):
    processed = map(lambda x: x * 3,
                    filter(lambda x: x % 2 == 0, values))

    # materialize the processed elements in a list
    return list(processed)

See anything wrong? Coming from a functional, compiled language this feels perfectly natural and is the last place you would suspect of any inefficiency. Mapping and filtering should be a zero-cost abstraction. But this is Python!

Let's measure this construct against what I would call a "Pythonic" equivalent: a list comprehension:

def process_comprehension(values):
    return [x * 3 for x in values if x % 2 == 0]

Observe that this is much less code. For better understanding of what's going on, let's also include a function with "naive" imperative approach to solving the same problem.

def process_loop(values):
    result = []

    for value in values:
        if value % 2 == 0:
            result.append(value * 3)

    return result

Let's measure all these:

>>> many_many_values = list(range(10 ** 7))
>>> %timeit process_functional(many_many_values)
# 940 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit process_comprehension(many_many_values)
# 409 ms ± 896 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit process_loop(many_many_values)
# 512 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The function using comprehensions wins big time, taking less than half of the mean run time compared to functional approach. The "naive" loop solution is slower than comprehension, but not that much. To fully appreciate why this is, we need to keep digging deeper.

The cost of being dynamic

In Python, everything is interpreted and dynamic. This means that:

  1. Every line of code is executed as-is, with little-to-none optimization (see this blog post from Chris Wellons).

  2. Variable lookups (including globals and builtins) are not free. Python maintains dicts of variable names in each "execution frame", and variable lookups are really dict lookups behind the scenes.

    Side note: there are some optimizations for local variables variables.

  3. Everything is dynamically type-checked before it's executed every time. For example, in x * 3, we need to check if the types are right to perform the multiplication (and which implementation to call, because you can provide custom implementations using __mul__).

  4. Built-in and compiled mechanisms are usually going to be faster than "pure Python" code. The overhead of fully dynamic execution can be mitigated or otherwise optimized using techniques simply not available in pure Python.

Yes, the CPython interpreter is quite dumb. In Python C source code simplicity and clarity is usually preferred over performance. But at the same time, this is in part why Python is so popular and successful, it's easy to extend and to integrate with other code.

Also worth nothing is that it's just hard to anticipate how many function calls are going to be made in each line of Python code. Python is a very-high level language and all this abstraction and convenience costs.

There might be other Python implementations out there (such as PyPy) that can make things faster, but CPython is still the default choice for most users and I'm going to focus on CPython exclusively.

What's going on behind the scenes

Let's take another look at our functions, this time getting down to the bytecode level. Python's virtual machine operates on bytecode. Before anything is executed, the code is first parsed into AST (Abstract Syntax Tree) and then compiled into bytecode that's understandable by the CPython virtual machine:

.py code --------> AST ----------> bytecode -----> execute
          (parse)       (compile)

Whereas the process of parsing and compiling .py code is hidden from the end user, Python makes it easy to access both the AST and bytecode.

We can easily explore the bytecode of any code file and any Python object. The dis.dis standard library function enables just that, as it prints the bytecode in a (fairly) human-readable way.

Disassembling the examples

Let's break down the dict access examples shown earlier:

>>> import dis
>>> d = {1: 2}
>>> def dict_lookup():
>>>     return d[1]
>>> def dict_get():
>>>     return d.get(1)
>>> dis.dis(dict_lookup)
  3           0 LOAD_GLOBAL              0 (d)
              2 LOAD_CONST               1 (1)
              4 BINARY_SUBSCR
              6 RETURN_VALUE

>>> dis.dis(dict_get)
  6           0 LOAD_GLOBAL              0 (d)
              2 LOAD_METHOD              1 (get)
              4 LOAD_CONST               1 (1)
              6 CALL_METHOD              1
              8 RETURN_VALUE

We can immediately tell that there's more instructions to be executed when there's a method call. Dict lookup uses a dedicated instruction. This explains why dict lookup is faster, as measured before.

Let's take a look at our more involved examples:

>>> def process_functional(values):
>>>     processed = map(lambda x: x * 3,
>>>                     filter(lambda x: x % 2 == 0, values))
>>>     return list(processed)
>>> dis.dis(process_functional)
  2           0 LOAD_GLOBAL              0 (map)
              2 LOAD_CONST               1 (<code object <lambda> at 0x7fd1773865b0, ...snip...>)
              4 LOAD_CONST               2 ('process_functional.<locals>.<lambda>')
              6 MAKE_FUNCTION            0

  3           8 LOAD_GLOBAL              1 (filter)
             10 LOAD_CONST               3 (<code object <lambda> at 0x7fd177383660, ...snip...>)
             12 LOAD_CONST               2 ('process_functional.<locals>.<lambda>')
             14 MAKE_FUNCTION            0
             16 LOAD_FAST                0 (values)
             18 CALL_FUNCTION            2

  2          20 CALL_FUNCTION            2
             22 STORE_FAST               1 (processed)

  6          24 LOAD_GLOBAL              2 (list)
             26 LOAD_FAST                1 (processed)
             28 CALL_FUNCTION            1
             30 RETURN_VALUE

Disassembly of <code object <lambda> at 0x7fd1773865b0, file "<...snip...>", line 2>:
  2           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 (3)
              4 BINARY_MULTIPLY
              6 RETURN_VALUE

Disassembly of <code object <lambda> at 0x7fd177383660, file "<...snip...>", line 3>:
  3           0 LOAD_FAST                0 (x)
              2 LOAD_CONST               1 (2)
              4 BINARY_MODULO
              6 LOAD_CONST               2 (0)
              8 COMPARE_OP               2 (==)
             10 RETURN_VALUE

In the "functional" example, there's a lot of function calls. Each of the lambda functions gets created from a code object (which is pre-compiled and cached). These are then called by the built-in map and filter functions to produce the final result list.

Compare that to the comprehension example:

>>> def process_comprehension(values):
>>>    return [x * 3 for x in values if x % 2 == 0]
>>> dis.dis(process_comprehension)
  2           0 LOAD_CONST               1 (<code object <listcomp> at 0x7f6ba0703660, ...snip...>)
              2 LOAD_CONST               2 ('process_comprehension.<locals>.<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_FAST                0 (values)
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x7f6ba0703660, file "<...snip...>", line 2>:
  2           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                24 (to 30)
              6 STORE_FAST               1 (x)
              8 LOAD_FAST                1 (x)
             10 LOAD_CONST               0 (2)
             12 BINARY_MODULO
             14 LOAD_CONST               1 (0)
             16 COMPARE_OP               2 (==)
             18 POP_JUMP_IF_FALSE        4
             20 LOAD_FAST                1 (x)
             22 LOAD_CONST               2 (3)
             24 BINARY_MULTIPLY
             26 LIST_APPEND              2
             28 JUMP_ABSOLUTE            4
        >>   30 RETURN_VALUE

Python breaks down this code into a single "function" which it calls to retrieve the list value. The function simply loops over its argument, not doing any further function calls. Fewer instructions, fewer function calls, means faster performance.

I won't go through more details and what all of this means, as in real world it's not as useful as a good profiler, but it's still good to know how the bytecode works :)

Bytecode in details

The bytecode contains a limited subset of "codes" (instructions) for the virtual machine, with some instructions taking arguments, others being argument-less. This is almost like a CPU and assembly, except on a much higher level of abstraction. All the possible instructions vary between Python versions, but are easily found in the Python documentation and CPython source code (the links point to all opcodes for Python 3.8.3).

You can see the C code for how to execute each of these instructions in the ceval.c file of the Python source. Turns out, Python interpreter is "just" one giant loop executing the instructions one by one! It's also evident that it essentially implements a simple virtual "CPU" on top of our hardware CPU. It's essentially why Python is slower than compiled languages.

Functional constructs

Python does have the essential feature of any functional programming language: first class functions. This means that functions are objects and can be returned out of other functions, used as arguments, applied partially ("bound") etc.

Python's "functional" utilities have been aggregated in the functools standard library module.

Using them over "ad-hoc" constructs is considered idiomatic, but doesn't necessarily come with performance benefits.

Let's consider partial over a simple lambda:

>>> import functools
>>> def add(a, b):
>>>     return a + b
>>> add_two_adhoc = lambda v: add(v, 2)
>>> add_two_partial = functools.partial(add, 2)
>>> %timeit add_two_adhoc(3)
# 82.8 ns ± 0.143 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit add_two_partial(3)
# 87.4 ns ± 0.828 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

partial has actually made our code a little bit slower.

Let's try lru_cache:

>>> from functools import lru_cache
>>> def square(v):
>>>    return v ** 2
>>> cache = {}
>>> def simplecache(func):
>>>     def wrapper(v):
>>>         if v in cache:
>>>             return cache[v]
>>>         ret = func(v)
>>>         cache[v] = ret
>>>         return ret
>>>     return wrapper
>>> @simplecache
>>> def cached_square(v):
>>>     return v ** 2
>>> @lru_cache(maxsize=None)
>>> def lru_square(v):
>>>     return v ** 2
>>> %timeit square(10)
# 167 ns ± 0.116 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit cached_square(10)
# 79.1 ns ± 0.157 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit lru_square(10)
# 41 ns ± 0.0496 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

lru_cache is twice as fast as a simple decorator with a dict-based cache. It definitely will yield performance benefits because of its optimized C implementation.

Method calls

Method calls used to be quite slow in Python. People sometimes advised against using classes and Object-Oriented style in performance-critical parts of systems written in Python.

It's easy to see why if we compare the disassembly of a simple method call between Python 2.7 and 3.7:

>>> import dis
>>> d = {}
>>> get_key = lambda k: d.get(k)

# Python 2.7
>>> dis.dis(get_key)
  1           0 LOAD_GLOBAL              0 (d)
              3 LOAD_ATTR                1 (get)
              6 LOAD_FAST                0 (k)
              9 CALL_FUNCTION            1
             12 RETURN_VALUE

# Python 3.7
>>> dis.dis(get_key)
  1           0 LOAD_GLOBAL              0 (d)
              2 LOAD_METHOD              1 (get)
              4 LOAD_FAST                0 (k)
              6 CALL_METHOD              1
              8 RETURN_VALUE

Python 3.7 introduced dedicated opcodes for loading and calling methods. This is a nice little optimization that made method calls quite a lot faster. Compare Python 2.7 and Python 3.8:

>>> def nothing():
>>>     pass
>>> class A(object):
>>>     def empty(self):
>>>         pass
>>> a = A()

# Python 2.7
>>> %timeit nothing()
# 10000000 loops, best of 3: 37.6 ns per loop
>>> %timeit a.empty()
# 10000000 loops, best of 3: 57.5 ns per loop

# Python 3.8
>>> %timeit nothing()
37.5 ns ± 0.0742 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit a.empty()
48.3 ns ± 0.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

So, things have improved and avoiding methods and OOP might no longer be as necessary as it once was. However, it's still worth noting that method calls have an extra overhead compared to free functions.

Take home

What it all means for Python programmers, is the following:

  • Too many function calls will slow down tight and nested loops.

  • Any time you have a lot of elements to process quickly, prefer comprehensions over explicit loops over function calls. Comprehensions are optimized and are complied down to native bytecode instructions.

  • Functional constructs are perfectly fine to use, as long as you don't "overdo" it and lean on fast standard library mechanisms.

  • It's good to be aware of the underlying implementations of Python and its libraries. In most cases, libraries implemented in compiled languages like C will be faster than pure Python.

  • Method calls are a bit slower than free function calls, even though things have improved with Python 3.7.

  • It's good to optimize and refactor for performance, but this should only come when it's really necessary. You should avoid premature optimization and Make it work, make it right, make it fast.

    Avoiding function calls is a micro-optimization that can make your code structure worse. In the extreme case, it can lead to duplication of code fragments, to avoid functions or methods altogether. It's not unheard of though -- I've seen it a few times in performance-critical production code.


That's all for this post. If you're interested in the inner workings of CPython, I recommend these resources: