Asyncio is neither fast nor slow

Don't listen to random benchmarks..

I recently came across an article benchmarking Python performances in web frameworks, comparing asyncio and sync performance.

The author sets out to measure performance of FastAPI/Django web servers running with postgresql comparing async and non-async workloads. The methodology is pretty reasonable, the following is his result for an endpoint with a single postgres database read:

Server type	workers	RPS	Latency avg	Latency max	Median
Sync Django	1	456	140ms	262ms	153ms
Sync Django	2	669	96ms	262ms	132ms
Sync Django Pooled	1	569	112ms	171ms	117ms
Sync Django Pooled	2	1822	35ms	98ms	50ms
Async Django	1	205	312ms	467ms	331ms
Async Django	2	541	118ms	304ms	196ms
FastAPI	1	236	271ms	372ms	287ms
FastAPI	2	409	156ms	433ms	224ms

As a result the author concludes:

These benchmarks show just how much optimization for sync web services Django and the Python has. Sync Django Pooled outperforms or matches all other configurations. Even FastAPI only performs better when it’s the sole bottleneck.

When I read these result I had some doubts, I noticed that:

1.⁠ ⁠it shows that the best sync django scenario with 2 workers, has more than 4x throughput that the equivalent fastapi async solution. 2.⁠ ⁠2 workers of sync django pooled is 3x the performance of a single worker.

The former doesn't line with the my personal experience. Asyncio does add overhead in some cases, but the scenario laid out here should have postgres as the bottleneck. Overhead caused by asyncio's event loop shouldn't be significant.

The latter also seemed a little odd, as the throughput increase should be proportional to the number of workers.

Reviewing the code

Since the results don't match my own expectation then either my mental model is wrong or it's the benchmark. So let's review the benchmark code.

The code for the fastapi endpoint is as follows:

@app.get("/quote")
async def quote(db: AsyncSession = Depends(get_db)):
    statement = (
        select(Quote).order_by(func.random()).options(selectinload(Quote.author))
    )
    result = await db.execute(statement)
    quote = result.scalar()

    return {
        "quote": quote.quote_text,
        "author": quote.author.name,
    }

Here a single random quote is fetched from the table. SQLAlchemy is used as an orm to facilitate query building. Everything seems in order.

But wait, there's a subtle problem here! By default SQLAlchemy does not batch results, instead it will fetch all results into memory first when ⁠ execute ⁠ is called. So whilst ⁠ .scalar() ⁠only returns the first result, in the background all results are fetched.

Recreating the benchmark

So I recreated the benchmark with my code fix for fastapi.

statement = (
    select(Quote).order_by(func.random()).limit(1).options(selectinload(Quote.author))
)

I tried where possible to keep the same code and parameters to the original.

•⁠ ⁠Hardware: Macbook Pro M3 11 cores 18 GB of RAM •⁠ ⁠Connection Pool Configuration: min 5 - max 15 connections •⁠ ⁠Data: 100 authors with 1000 quotes •⁠ ⁠Benchmark: ⁠ rewrk -d 30s -c 64 --host http://localhost:8000/quote/ ⁠

Results

And the results are as follows:

Server type	workers	RPS	Latency avg	Latency max	Median
Sync Django	1	341.81	186.61ms	418.90ms	223.96ms
Sync Django	2	338.74	188.07ms	467.42ms	227.23ms
Sync Django Pooled	1	440.46	144.92ms	281.12ms	150.94ms
Sync Django Pooled	2	419.01	152.30ms	374.59ms	167.25ms
Async Django	1	323.32	197.28ms	497.76ms	249.85ms
Async Django	2	322.78	197.45ms	453.63ms	249.18ms
FastAPI	1	687.26	92.95ms	326.49ms	110.29ms
FastAPI	2	673.85	94.78ms	332.11ms	121.00ms

As expected the FastAPI implementation does a lot better here compared to before. Performing better than the best django version.

What's weird is that the worker count makes very little difference in the performance, often slowing down rather than speeding up. My guess is that my setup has more rows in the database, therefore we reach a saturation point a lot quicker.

The dilemma of benchmarks

Still you might be a bit disappointed with this benchmark as I was, the original had a much clearer trend between async/sync worker count and pooled database connections. However this is precisely what happens with these "real world" benchmarks. Here we're trying to represent a real use case, not necessarily to prove a point.

On the other hand, there exist many micro-benchmarks out there, where variables are controlled and a smaller more predictable scenario is measured.

There are places for both types of benchmarks, real world benchmarks can be more interesting but the data is messy and requires more critical analysis.

Critical analysis

My results are convenient for my beliefs and one might conclude from it that asyncio is indeed much faster.

But looking at the numbers I wondered why Django is so much slower, where does the slow down actually come from. I expected asyncio to do quite well but I assumed we can make up the performance with more threads or workers, but that does not seem to be the case.

This lead me into a deep dive, trying almost every combination of configuration, implementation and framework.

First I wanted more control over the variables, Django is a very different framework compared to FastAPI and a there are too many possibilities when it comes to the performance discrepancy. Flask is a lighter weight framework compared to Django as such it is a much better sync vs async comparison.

Secondly, the original benchmark makes a point of the benefits of pooled connections. However, async django can also benefit from pooled connections. Sqlalchemy also uses a connection pool by default. So I see no reason to not use connection pools across the board.

The new results

This time around I decided to run different thread and worker configurations through the gauntlet and only keep the best performing configurations.

This was generally 3 workers and 5 threads on my machine, giving me a total of 15 threads. However other configurations with similar total thread counts also performed very similarly.

Server type	workers	threads	RPS	Latency avg	Latency max	Median
FastAPI (Pooled)	1	-	687.26	92.95ms	326.49ms	110.29ms
Flask (Pooled)	3	5	682.73	93.66ms	239.67ms	104.34ms
Django (Pooled)	3	5	411.91	155.17ms	299.19ms	169.44ms
Django Async (Pooled)	1	-	406.90	152.79ms	279.93ms	160.32ms

NOTE: asgi for the async frameworks don't use threads

And this time, the results are more even with the fastapi implementation pulling a little ahead over flask, though over many different runs the results are practically identical.

This is more an expected result, as once again the main bottleneck is the database and not the web framework.

These results show that there is no significant difference between the django implementations. The original discrepancies are easily explained by the the connection pooling.

The only thing that doesn't make sense to me is why django is slower than the flask implementation all else being equal.

Digging even deeper

My immediate thought was skill issues, I don't have a lot of experiences with django and it's possible I made a mistake that lead to performance degradation. Checking the code and the orm generated SQL query, nothing really stood out. Then I confirmed that the database driver we're using is indeed psycopg 3 so no differences there.

Finally, I decided to give a closer look at the connection pool used in django, and that's where it all clicked. django uses psycopg_pool an implementation provided directly by psycopg. Where as sqlalchemy has its own implementation. The differences are actually significant,

sqlalchemy has a lazy pool implementation, which means that connections are created or destroyed only when the pool is being accessed.
⁠On the other hand, psycopg-pool actively maintains the pool and processes tasks using background thread workers.

So one more time, I modified my fastapi and flask implementation to use the same connection pool as Django:

@app.get("/quote/")
async def quote() -> PlainTextResponse:
    async with pool.connection() as conn:
        async with conn.cursor() as cur:

            await cur.execute(
                "SELECT q.id, q.quote_text, a.name FROM quotes_quote q "
                "JOIN quotes_author a ON q.author_id = a.id "
                "ORDER BY RANDOM() LIMIT 1"
            )
            row = await cur.fetchone()
            return PlainTextResponse(f"{row[1]}\n\n--{row[2]}")

And I got the following results:

Server type	workers	threads	RPS	Latency avg	Latency max	Median
FastAPI	1	-	687.26	92.95ms	326.49ms	110.29ms
FastAPI (psycopg_pool)	1	-	526.21	121.49ms	248.73ms	128.85ms
Flask	3	5	682.73	93.66ms	239.67ms	104.34ms
Flask (psycopg_pool)	3	5	532.82	119.95ms	289.91ms	172.75ms
Django	3	5	411.91	155.17ms	299.19ms	169.44ms
Django Async	1	-	406.90	152.79ms	279.93ms	160.32ms

So it's quite clear here that psycopg_pool is the root cause in this particularly scenario.

There is still some differences between django and the other frameworks, but I think that's likely because we're using raw sql queries over django's orm.

Conclusion

If you were hoping for an answer of "just use async" or "don't use async", life is not so simple. The lesson I hope you take is that there are nuances when it comes to this topic.

Time and time again there's been posts on async performances, and we often see misleading benchmarks or analysis.

Let's sum up my own thoughts:

Asyncio's advantage has never been speed in particular, but the cost. We can achieve the same performance with no added threads or processes.
Even then you might consider using django or other sync frameworks if you're more familiar with them or if they provide something you don't get elsewhere, e.g. django's rich middleware plugin system.
Benchmarks cannot be taken at face value, try to reproduce it for your requirements (LLMs are pretty good at this).
Lastly when you get a result from a benchmark that doesn't match expectation, it's prudent that you investigate where the discrepancy comes from.

Future Steps

This whole exercise, as exhausting as it is, barely scratches the itch. There's quite a few unexplored questions:

What is the advantage of psycopg_pool's implementation, are there cases where it performs better?
Are there any configuration or ticks I'm missing to speed up Django?
There's not a lot of analysis on async django as it uses threads to run database queries and middlewares, but I think it's worth looking at which situations where async django wins.
Is there a general rule for thread/process numbers that maximise the performances and what are the trade offs of having more threads.

In the near future I intend to put together a more comprehensive comparison of different concurrency models and explore these questions.

The best I could do right now is to provide the full source code and hopefully encourage people to make their own measurements and share their results. In case there's been any mistakes the feedback is very much welcome there.