Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Research: demonstrate if parallel SQL queries are worthwhile

See original GitHub issue

I added parallel SQL query execution here:

https://github.com/simonw/datasette/issues/1723

My hunch is that this will take advantage of multiple cores, since Python’s sqlite3 module releases the GIL once a query is passed to SQLite.

I’d really like to prove this is the case though. Just not sure how to do it!

Larger question: is this performance optimization actually improving performance at all? Under what circumstances is it worthwhile?

Issue Analytics

State:
Created a year ago
Comments:32 (30 by maintainers)

Top GitHub Comments

2reactions

simonwcommented, Apr 29, 2022

OK, I just got the most incredible result with that!

I started up a container running bash like this, from my datasette checkout. I’m mapping port 8005 on my laptop to port 8001 inside the container because laptop port 8001 was already doing something else:

docker run -it --rm --name my-running-script -p 8005:8001 -v "$PWD":/usr/src/myapp \
  -w /usr/src/myapp nogil/python bash

Then in bash I ran the following commands to install Datasette and its dependencies:

pip install -e '.[test]'
pip install datasette-pretty-traces # For debug tracing

Then I started Datasette against my github.db database (from github-to-sqlite.dogsheep.net/github.db) like this:

datasette github.db -h 0.0.0.0 --setting trace_debug 1

I hit the following two URLs to compare the parallel v.s. not parallel implementations:

http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10
http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10&_noparallel=1

And… the parallel one beat the non-parallel one decisively, on multiple page refreshes!

Not parallel: 77ms

Parallel: 47ms

So yeah, I’m very confident this is a problem with the GIL. And I am absolutely stunned that @colesbury’s fork ran Datasette (which has some reasonably tricky threading and async stuff going on) out of the box!

0reactions

fgreggcommented, Sep 26, 2022

from your analysis, it seems like the GIL is blocking on loading of the data from sqlite to python, (particularly in the fetchmany call)

this is probably a simplistic idea, but what if you had the python code in the execute method iterate over the cursor and yield out rows or small chunks of rows.

something like:

            with sqlite_timelimit(conn, time_limit_ms):
                try:
                    cursor = conn.cursor()
                    cursor.execute(sql, params if params is not None else {})
                except:
                    ...
            max_returned_rows = self.ds.max_returned_rows
            if max_returned_rows == page_size:
                max_returned_rows += 1
                if max_returned_rows and truncate:
                    for i, row in enumerate(cursor):
                        yield row
                        if i == max_returned_rows - 1:
                            break
                else:
                    for row in cursor:
                        yield row
                    truncated = False

this kind of thing works well with a postgres server side cursor, but i’m not sure if it will hold for sqlite.

you would still spend about the same amount of time in python and would be contending for the gil, but it would be could be non blocking.

depending on the data flow, this could also some benefit for memory. (data stays in more compact sqlite-land until you need it)

Top Results From Across the Web

Simon Willison Twitterissä: "Weeknotes: Parallel SQL queries ...

Weeknotes: Parallel SQL queries for Datasette, plus some middleware tricks ... Research: demonstrate if parallel SQL queries are worthwhile · Issue #1727 ...

Understanding parallel queries in SQL Server 7.0

When SQL Server executes a query in parallel, it breaks down this single ... To determine if a query is a good candidate...

Parallelism doesn't provide any performance gains

My select queries are getting pretty slow so I have began experimenting with parallelism but it seems that I don't see any performance...

Parallel processing: Using parallel SQL effectively - TechTarget

The lesson is obvious: If every concurrent SQL in the system tries to use all the resources of the system, parallel makes performance...

Slow Parallel SQL Server query, almost instant in serial

When I run the parallel query with the live execution plan enabled, the operator highlighted in green (Clustered Index scan on TABLE2 )...