Free-threaded Python (PEP-703) was released in October 2024. It enables true multi-threaded execution without the restriction of the GIL.
I previously covered this in Free Threaded Python With Asyncio, the example used there was selected because it very clearly demonstrates the performance increase, or moreover, the threads "running free".
But that's not really code that solves a problem. It's simply repeating the same calculation. As I've been having a lot of fun with advent of code this December, it provided a very good testbed for free threading.
AOC Day 6
Spoilers for day 6 2024 ahead! full source code here
Day 6 was a good example of a problem that can be solved using parallelisation.
Part 1 involved tracing the path of a guard through a map. Part 2 then asks for locations where we could place a single obstacle and cause the guard to form a loop.
Here's my full solution:
from __future__ import annotations
from contextlib import contextmanager
from dataclasses import dataclass
from itertools import pairwise
from pathlib import Path
from typing import Hashable, Iterable, Iterator, Self, assert_never
type Pair = tuple[int, int]
up = 0
right = 1
down = 2
left = 3
def turn(direction: int) -> int:
return (direction + 1) % 4
@dataclass(slots=True)
class Ranges:
"""Traversable location in the map."""
rows: range
cols: range
def __getitem__(self, key: tuple[slice, int] | tuple[int, slice]) -> Iterator[Pair]:
match key:
case int() as r, slice() as cols:
return ((r, c) for c in self.cols[cols])
case slice() as rows, int() as c:
return ((r, c) for r in self.rows[rows])
case _ as other:
assert_never(other)
def walk(self, start: Pair, direction: int) -> Iterator[Pair]:
match direction:
case 0:
return ((row, start[1]) for row in self.rows[start[0] :: -1])
case 2:
return ((row, start[1]) for row in self.rows[start[0] :: 1])
case 3:
return ((start[0], col) for col in self.cols[start[1] :: -1])
case 1:
return ((start[0], col) for col in self.cols[start[1] :: 1])
case _:
assert False
@dataclass(slots=True)
class Grid:
start: Pair
tiles: list[list[bool]]
ranges: Ranges
@classmethod
def from_lines(cls, lines: Iterable[str]) -> Self:
start = None
coord = (0, 0)
rows = []
for r, line in enumerate(lines):
row: list[bool] = []
for c, char in enumerate(line):
coord = (r, c)
if char == "^":
start = coord
row.append(char == "#")
rows.append(row)
assert start is not None
return cls(start, rows, Ranges(range(len(rows)), range(len(rows[0]))))
def __getitem__(self, key: Pair) -> bool:
return self.tiles[key[0]][key[1]]
@contextmanager
def read_lines(path: Path) -> Iterator[Iterator[str]]:
with path.open() as f:
yield (line.rstrip() for line in f)
def walk(grid: Grid, obstruction: Pair | None = None) -> Iterator[Pair]:
direction = up
start = grid.start
yield start
while True:
walk = pairwise(grid.ranges.walk(start, direction))
for prev, curr in walk:
if grid[curr] or curr == obstruction:
direction = turn(direction)
start = prev
break
yield curr
else:
return
def loops[T: Hashable](it: Iterator[T]) -> bool:
visited = set()
for e in it:
if e in visited:
return True
visited.add(e)
return False
if __name__ == "__main__":
with read_lines(Path("inputs") / "d6.txt") as lines:
grid = Grid.from_lines(lines)
path = set(walk(grid))
print("part1", len(path))
candidates = (node for node in path if node != grid.start)
print(
"part2",
sum(1 for node in candidates if loops(pairwise(walk(grid, node)))),
)
A few details I want to highlight here:
- Grid
implements __getitem__
so we can access locations on a map more directly e.g. grid[1, 2]
will access location (1, 2)
- Ranges
represents the full range of locations in the map and allows us to easily trace paths in a single direction
- walk
brings everything together and handles the logic when we hit an obstruction.
Part 2 is what we're more interested in here. We iterate for all locations on the path (around 5000 in my case) and check if adding an obstruction there forms a loop.
Parallelisation
I timed just part 2 and it takes around 4 seconds on my 11 core M3 Mac. However it should be simple to parallelise, as each of these walks are independent and the grid is only being read not modified.
Using ThreadPoolExecutor
, we can change the last part of the code to run in parallel:
with ThreadPoolExecutor() as executor:
results = executor.map(
lambda node: loops(pairwise(walk(grid, node))),
candidates,
)
print("part2", sum(1 for r in results if r))
I was then shocked to find that the execution was almost twice as slow at almost 10 seconds.
Hypothesis
There are several potential reasons to explain this kind of behaviour.
Specialization
Python 3.11 released with performance enhancements via Specialization.
Free-threaded Python has specialisation disabled in threaded code, see https://peps.python.org/pep-0703/#improved-specialization for more details
The performance overhead should however be fairly small, according to the PEP the slowdown should be within 10%.
Lock contention
The GIL's job was to protect objects from concurrent access. This is still important in free-threaded python, so there must be per-object locks in place of the GIL.
These locks can still cause threads to block waiting on them.
Thread start up time
This is more of an issue associated with processes, sub-interpreters but not threads. See Anthony Shaw's post. So the startup time cannot explain the slowness.
Profiling
To test my theories, my immediate thought was just run cProfile
.
python -m cProfile d6.py
... and nothing! The process hangs, and no output is shown. It turns out there's a bug in cProfile for free-threaded builds of Python 🤦.
Well this made the problem exponentially harder. I've tried a few other profiler options without success. I briefly considered looking into sys.monitoring but it would probably take me a lot more research to write a profiler.
Experiment, Research, Repeat
Without good profiling the only thing to do is to understand the problem better and try a bunch of things.
Time per thread count
First thing I checked the performance with respect to the thread count, by default, python sets the max_worker
count to n + 4 where n is the number of cores in the system.
for workers in range(1, 16):
candidates = (node for node in path if node != grid.start)
with timer(f"{workers = }: "):
with ThreadPoolExecutor(max_workers=workers) as executor:
results = executor.map(
lambda node: loops(pairwise(walk(grid, node))),
candidates,
)
print("part2", sum(1 for r in results if r), end="; ")
(advent2024-python) ➜ advent2024-python git:(main) ✗ python d6.py
part1 4883
part2 1655; workers = 1: 4.060725 s elapsed
part2 1655; workers = 2: 4.789434 s elapsed
part2 1655; workers = 3: 5.203642 s elapsed
part2 1655; workers = 4: 5.516271 s elapsed
part2 1655; workers = 5: 6.114952 s elapsed
part2 1655; workers = 6: 7.461776 s elapsed
part2 1655; workers = 7: 7.562252 s elapsed
part2 1655; workers = 8: 8.026832 s elapsed
part2 1655; workers = 9: 8.335291 s elapsed
part2 1655; workers = 10: 8.731357 s elapsed
part2 1655; workers = 11: 9.152564 s elapsed
part2 1655; workers = 12: 9.476967 s elapsed
part2 1655; workers = 13: 9.589462 s elapsed
part2 1655; workers = 14: 9.955716 s elapsed
part2 1655; workers = 15: 9.728167 s elapsed
So more thread means less performance here, this strongly suggests some sort of lock contention.
How locking works
Built-in types use internal locks to guarantee thread safety.
I read the PEP and it appears that the locking is rather complicated. For the most part read access is "optimistic". Since we only read the list it's unlikely to be the culprit. However, to eliminate any doubts I've switched to the immutable tuple, which does not require locking:
@dataclass(slots=True)
class Grid:
start: Pair
tiles: tuple[tuple[bool, ...], ...]
ranges: Ranges
As expected this did not solve the issue and the performance degradation did not go away.
Reference counting
This is going very deep into Python internals, not something I'm particularly comfortable with. I did get a bit of a clue here where someone else also ran into performance issues.
Reference counting is the method used by Python for garbage collection. The idea is that each object stores a counter for the number of references which is then used for garbage collection when the references hit 0.
The bigger implication here is that referencing an object will require the counter to be incremented, which requires locking when the object is referenced from multiple threads.
A form of reference counting called "biased reference counting" is used in free-threaded Python. This allows faster reference count access to objects created within the thread than objects created from different threads.
Overall, this means that even read-only access can be slowed down due to reference counting.
Immortal objects
See PEP-683, this is a recent change too that skips reference counting by marking certain objects as immortal. Immortalization is implemented in free-threaded python as well. It's only really applied to True
, False
, None
as well as classes and top level functions.
Here my grid is already using booleans so it's already using immortal objects. I did use Enum
in an earlier version and there was definitely an improvement of the performance here. But it wasn't very significant. Nevertheless it might solve performance issues in other situations.
The Breakthrough
After many iterations, I finally looked at the following line:
if grid[current] or current == obstruction:
This uses the magic method __getitem__
def __getitem__(self, key: Pair) -> bool:
return self.tiles[key[0]][key[1]]
which references the Grid
object and gets the in turn then references the tiles
attribute.
The code also lives in the walk
function inside the double loop so it's called frequently!
def walk(grid: Grid, obstruction: Pair | None = None) -> Iterator[Pair]:
...
while True:
...
for prev, curr in walk:
if grid[curr] or curr == obstruction:
...
...
...
I'd like to claim that I intuitively looked here, but really I looked in a lot of different places before finding this.
In any case we can bypass the __getitem__
call by
for prev, (r, c) in walk:
if grid.tiles[r][c] or (r, c) == obstruction:
Running the code again:
(advent2024-python) ➜ advent2024-python git:(main) ✗ python d6_threads.py
part1 4883
part2 1655; workers = 1: 3.801581 s elapsed
part2 1655; workers = 2: 3.10458 s elapsed
part2 1655; workers = 3: 2.925348 s elapsed
part2 1655; workers = 4: 2.944652 s elapsed
part2 1655; workers = 5: 3.200314 s elapsed
part2 1655; workers = 6: 4.337672 s elapsed
part2 1655; workers = 7: 4.507129 s elapsed
part2 1655; workers = 8: 4.739261 s elapsed
part2 1655; workers = 9: 4.932432 s elapsed
part2 1655; workers = 10: 5.219045 s elapsed
part2 1655; workers = 11: 5.798776 s elapsed
part2 1655; workers = 12: 6.20557 s elapsed
part2 1655; workers = 13: 6.459673 s elapsed
part2 1655; workers = 14: 6.571967 s elapsed
part2 1655; workers = 15: 6.590083 s elapsed
This is actually the first time we've seen positive performance gain. Curiously, the higher thread counts still result in negative performance.
Taking this further we still reference grid
inside the loop, so we can assign tiles to the outside instead:
tiles = grid.tiles
while True:
walk = pairwise(ranges.walk(start, direction))
for prev, (r, c) in walk:
if tiles[r][c] or (r, c) == obstruction:
...
Once again:
(advent2024-python) ➜ advent2024-python git:(main) ✗ python d6_threads.py
part1 4883
part2 1655; workers = 1: 3.572743 s elapsed
part2 1655; workers = 2: 2.257732 s elapsed
part2 1655; workers = 3: 1.665534 s elapsed
part2 1655; workers = 4: 1.38678 s elapsed
part2 1655; workers = 5: 1.383013 s elapsed
part2 1655; workers = 6: 2.540588 s elapsed
part2 1655; workers = 7: 3.520335 s elapsed
part2 1655; workers = 8: 4.365446 s elapsed
part2 1655; workers = 9: 5.072404 s elapsed
part2 1655; workers = 10: 5.722957 s elapsed
part2 1655; workers = 11: 6.087521 s elapsed
part2 1655; workers = 12: 6.179896 s elapsed
part2 1655; workers = 13: 6.174642 s elapsed
part2 1655; workers = 14: 6.194982 s elapsed
part2 1655; workers = 15: 6.100845 s elapsed
A decent performance when using 3, 4 or 5 workers.
Worker count
It seems to me like there's a sweet spot for the number of cores vs the reference contention. 4 seems to be quite consistently good on my machine too cores then the reference count contention starts dominating the timing.
It's also possible that it's an indication of performance cores vs efficiency cores on the M3 processor, but I believe this is less likely as we've already demonstrated how dramatic the differences are when it comes to reference count contention.
Key Takeaways
When looking at free threading in Python here are the following key things to look for.
Threads are not free
Threads are not free in terms of performance, you definitely have to pay for it. And concurrent access means you can't really be free of locks.
Avoid shared mutable writes
In this example we didn't have any shared writes, however in based on my previous testing, this can be extremely slow.
Try to structure objects so that they don't need to be mutated across threads.
Minimise shared reads
I don't believe shared reads that can be avoided, the ease of memory access is the biggest advantage of multi-threading.
Worker count may be important
A direct takeaway from the idea that "threads are not free" is that you may need to reduce the number of threads based on the situation.
It's important to test the code with different number of worker threads.
Profiling is important
Profiling is extremely important, I hope cProfile gets fixed or some other profiler can help with this situation.
Conclusion
There's a lot to unpack here, but my biggest thought is that multi-threading in Python is actually deceptively difficult. Simple Python operations may have an amplified performance impact.
Debugging performance problems should get easier when we have profiling working again, so this shouldn't be a big issue for engineers looking to solve parallel problems.
However I'm a bit worried about existing multi-threaded code. Before free-threaded python, threads were used for IO bound operations when the GIL was released. It's going to be a big challenge if one day the GIL is disabled and the performance may become many times slower.
I'm starting to think that the approach sub-interpreters is taking, with brand new concurrency primitives and shared memory data structures has a lot of merits. I like the idea of adding a new dimension of concurrency as opposed to modifying the current. Perhaps I'll try to implement my AOC solution using sub-interpreters.
Finally, despite the effort it took me here, I can see that a lot of work has gone into Python to make the GIL optional. From immortalization to biased reference counting and many more I haven't covered. There is definitely a lot of space for performance improvement. I can't wait to see how the framework authors take advantage of free-threading.