Python Fellow, ASF Member and hacker
The FAT Python project was started by Victor Stinner in October 2015 to try to solve issues of previous attempts of “static optimizers” for Python. Victor has created a set of changes to CPython (Python Enhancement Proposals or “PEPs”), some example optimizations and benchmarks.
We’ll explore those 3 levels in this article.
Not sure what the AST is? Then read my other article first.
Transforming an Abstract Syntax Tree (AST) is a more logical way to optimize code. By staying at a higher-level you can make decisions traversing up, down and across a branch.
For example, some optimizations which can be implemented with an AST optimizer:
Using guards (see next chapter), it is possible to implement a much wider choice of optimizations. Examples:
FAT Optimizer is a set of static optimizations, based on 3 major changes to CPython. These changes may be merged into Python 3.7 or 3.8, but for now we have to compile them by hand.
Why do we need to make changes to CPython? Well, CPython doesn’t have an API for optimizing the AST. The AST is not pluggable, so enter PEP 511..
PEP 511 is a proposal to add a process to optimize an AST instance. The AST instance is a object-oriented representation of your code. The optimizer could be generic, looking at general optimizations, or more bespoke.
A bespoke optimizer could look at a set of domain specific changes, e.g. NumPy or Pandas “anti-patterns” and optimize them in the syntax tree. In replacement of a static linter that simply recommends changes, the optimizer could make those changes for you.
PEP 511 draws out a change to the
sys module, to set optimizers which can be used, then it provides the API for optimizer instances.
In future, optimizers could be included in Python packages and shared on PyPi (pretty cool-huh!?).
PEP 511 also assumes that in order to be effective, the optimizers will need some other changes to core CPython internals, enter PEP 509..
In Python, the builtin dict type is used by many instructions. For example, the
LOAD_GLOBAL instruction looks up a variable in the global namespace, or in the builtins namespace (two dict lookups). Python uses dict for the builtins namespace, globals namespace, type namespaces, instance namespaces, etc. The local namespace (function namespace) is usually optimized to an array, but it can be a dict too.
Python is hard to optimize because almost everything is mutable: builtin functions, function code, global variables, local variables, … can be modified at runtime. Implementing optimizations respecting the Python semantics requires to detect when “something changes”: we will call these checks “guards”.
The speedup of optimizations depends on the speed of guard checks. PEP 509 proposes to add a private version to dictionaries to implement fast guards on namespaces.
How do you get guards in Python? Enter PEP 510
PEP 510 proposes to add a public API to the Python C API to add specialized codes with guards to a function. When the function is called, a specialized code is used if nothing changed, otherwise use the original bytecode.
It’s simpler to see all this in action so let’s install those PEPs and test out an optimizer.
git clone https://github.com/haypo/cpython -b fatpython fatpython
./configure CFLAGS='-O0' --enable-shared
On my mac, I had to copy
_sysconfigdata_dm_darwin_darwin.pyfrom the build directory to the lib directory.
Install fat. FAT is an implementation of PEP510 for specialised guards.
git clone https://github.com/haypo/fat
../python setup.py build
cp -v build/lib*/fat.*so ../Lib
For OS X users, use
./pythonthroughout this article
fatoptimizer, an implementation of PEP 511 with a set of optimizations.
git clone https://github.com/haypo/fatoptimizer
ln -s ../fatoptimizer/fatoptimizer .
Now we have a working CPython executable with FAT. A simple test can show the guards described in PEP 510.
Now to compare speed, run a basic timeit, checking the call of a function that returns the length of a basic string.
$ ./python.exe -X fat -m timeit -s ‘def f(): return len(“abc”)’ ‘f()’
2000000 loops, best of 5: 115 nsec per loop
I tested this again (removing
-X fat ) with CPython 3.6 and 2.7. That’s a 24% improvement over 3.6.
The optimizations themselves are described in the fatoptimizer documentation.
If you disassemble the basic test, it looks normal. Call a builtin function, return value.
>>> import dis
1 0 LOAD_GLOBAL 0 (len)
3 LOAD_CONST 1 ('abc')
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
But, with fat you can access the specialized (i.e. optimized) version of that function. This was generated when the method was loaded into the ast. Fat preserves the original ast and stores an optimized method.
1 0 LOAD_CONST 1 (3)
When we called timeit over our method, it was running the optimized function, which just returns 3 instead of calling
len . It’s clear to see why that’s faster.
However, in Python you can replace anything with anything, including a builtin in the global namespace. If you were to replace
len with something else, it removes the guard and executes the original ast.
There are connections to the FAST_METHODCALL implementation that are being reintroduced (I think) in 3.7. This means less overhead for calling methods, a big slowdown in CPython.
Combining these 3 PEPs, we could see both implementation of guards as well as well as a range of optimizers out on PyPi.