8,242 reads

FAT Python : the next chapter in Python optimization

by Anthony ShawJuly 10th, 2017

Too Long; Didn't Read

The FAT Python project was started by Victor Stinner in October 2015 to try to solve issues of previous attempts of “static optimizers” for Python. Victor has created a set of changes to CPython (Python Enhancement Proposals or “PEPs”), some example optimizations and benchmarks.

featured image - FAT Python : the next chapter in Python optimization

We’ll explore those 3 levels in this article.

Optimizations in the AST

Not sure what the AST is? Then read my other article first.

Transforming an Abstract Syntax Tree (AST) is a more logical way to optimize code. By staying at a higher-level you can make decisions traversing up, down and across a branch.

For example, some optimizations which can be implemented with an AST optimizer:

Copy propagation : replace x=1; y=x with x=1; y=1
Constant folding : replace 1+1 with 2
Dead code elimination

Using guards (see next chapter), it is possible to implement a much wider choice of optimizations. Examples:

Simplify iterable: replace range(3) with (0, 1, 2) when used as iterable
Loop unrolling
Call pure builtins: replace len(“abc”) with 3
Copy used builtin symbols to constants

How does FAT work

FAT Optimizer is a set of static optimizations, based on 3 major changes to CPython. These changes may be merged into Python 3.7 or 3.8, but for now we have to compile them by hand.

Why do we need to make changes to CPython? Well, CPython doesn’t have an API for optimizing the AST. The AST is not pluggable, so enter PEP 511..

PEP 511

PEP 511 is a proposal to add a process to optimize an AST instance. The AST instance is a object-oriented representation of your code. The optimizer could be generic, looking at general optimizations, or more bespoke.

A bespoke optimizer could look at a set of domain specific changes, e.g. NumPy or Pandas “anti-patterns” and optimize them in the syntax tree. In replacement of a static linter that simply recommends changes, the optimizer could make those changes for you.

PEP 511 draws out a change to the sys module, to set optimizers which can be used, then it provides the API for optimizer instances.

In future, optimizers could be included in Python packages and shared on PyPi (pretty cool-huh!?).

PEP 511 also assumes that in order to be effective, the optimizers will need some other changes to core CPython internals, enter PEP 509..

PEP 509

In Python, the builtin dict type is used by many instructions. For example, the LOAD_GLOBAL instruction looks up a variable in the global namespace, or in the builtins namespace (two dict lookups). Python uses dict for the builtins namespace, globals namespace, type namespaces, instance namespaces, etc. The local namespace (function namespace) is usually optimized to an array, but it can be a dict too.

Python is hard to optimize because almost everything is mutable: builtin functions, function code, global variables, local variables, … can be modified at runtime. Implementing optimizations respecting the Python semantics requires to detect when “something changes”: we will call these checks “guards”.

The speedup of optimizations depends on the speed of guard checks. PEP 509 proposes to add a private version to dictionaries to implement fast guards on namespaces.

How do you get guards in Python? Enter PEP 510

PEP 510

PEP 510 proposes to add a public API to the Python C API to add specialized codes with guards to a function. When the function is called, a specialized code is used if nothing changed, otherwise use the original bytecode.

It’s simpler to see all this in action so let’s install those PEPs and test out an optimizer.

Getting FAT running

Compile a copy of CPython 3.6 patched with PEP 509, PEP 510 and PEP 511. You’ll need a C compiler to do this, if you don’t have one already, GCC is the easy option.

git clone https://github.com/haypo/cpython -b fatpython fatpythoncd fatpython./configure CFLAGS='-O0' --enable-sharedmake

On my mac, I had to copy _sysconfigdata_dm_darwin_darwin.py from the build directory to the lib directory.

Install fat. FAT is an implementation of PEP510 for specialised guards.

git clone https:**//github.com/haypo/**fatcd fat**../python setup.**py buildcp **-v build/lib*/fat.***so **../**Libcd ..

For OS X users, use ./python.exe instead of ./python throughout this article

Install fatoptimizer, an implementation of PEP 511 with a set of optimizations.

git clone https:**//github.com/haypo/**fatoptimizercd Libln **-**s **../fatoptimizer/**fatoptimizer .

Now we have a working CPython executable with FAT. A simple test can show the guards described in PEP 510.

Now to compare speed, run a basic timeit, checking the call of a function that returns the length of a basic string.

$ ./python.exe -X fat -m timeit -s ‘def f(): return len(“abc”)’ ‘f()’ 2000000 loops, best of 5: 115 nsec per loop

I tested this again (removing -X fat ) with CPython 3.6 and 2.7. That’s a 24% improvement over 3.6.

The optimizations themselves are described in the fatoptimizer documentation.

How does it work?

If you disassemble the basic test, it looks normal. Call a builtin function, return value.

>>> import dis>>> dis.dis(func)1 0 LOAD_GLOBAL 0 (len)3 LOAD_CONST 1 ('abc')6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)9 RETURN_VALUE

But, with fat you can access the specialized (i.e. optimized) version of that function. This was generated when the method was loaded into the ast. Fat preserves the original ast and stores an optimized method.

>>> dis.dis(fat.get_specialized(func)[0][0])1 0 LOAD_CONST 1 (3)3 RETURN_VALUE

When we called timeit over our method, it was running the optimized function, which just returns 3 instead of calling len . It’s clear to see why that’s faster.

However, in Python you can replace anything with anything, including a builtin in the global namespace. If you were to replace len with something else, it removes the guard and executes the original ast.

What next for FAT?

There are connections to the FAST_METHODCALL implementation that are being reintroduced (I think) in 3.7. This means less overhead for calling methods, a big slowdown in CPython.

Combining these 3 PEPs, we could see both implementation of guards as well as well as a range of optimizers out on PyPi.