question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Research: demonstrate if parallel SQL queries are worthwhile

See original GitHub issue

I added parallel SQL query execution here:

My hunch is that this will take advantage of multiple cores, since Python’s sqlite3 module releases the GIL once a query is passed to SQLite.

I’d really like to prove this is the case though. Just not sure how to do it!

Larger question: is this performance optimization actually improving performance at all? Under what circumstances is it worthwhile?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:32 (30 by maintainers)

github_iconTop GitHub Comments

2reactions
simonwcommented, Apr 29, 2022

OK, I just got the most incredible result with that!

I started up a container running bash like this, from my datasette checkout. I’m mapping port 8005 on my laptop to port 8001 inside the container because laptop port 8001 was already doing something else:

docker run -it --rm --name my-running-script -p 8005:8001 -v "$PWD":/usr/src/myapp \
  -w /usr/src/myapp nogil/python bash

Then in bash I ran the following commands to install Datasette and its dependencies:

pip install -e '.[test]'
pip install datasette-pretty-traces # For debug tracing

Then I started Datasette against my github.db database (from github-to-sqlite.dogsheep.net/github.db) like this:

datasette github.db -h 0.0.0.0 --setting trace_debug 1

I hit the following two URLs to compare the parallel v.s. not parallel implementations:

  • http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10
  • http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10&_noparallel=1

And… the parallel one beat the non-parallel one decisively, on multiple page refreshes!

Not parallel: 77ms

Parallel: 47ms

CleanShot 2022-04-28 at 22 10 54@2x CleanShot 2022-04-28 at 22 10 21@2x

So yeah, I’m very confident this is a problem with the GIL. And I am absolutely stunned that @colesbury’s fork ran Datasette (which has some reasonably tricky threading and async stuff going on) out of the box!

0reactions
fgreggcommented, Sep 26, 2022

from your analysis, it seems like the GIL is blocking on loading of the data from sqlite to python, (particularly in the fetchmany call)

this is probably a simplistic idea, but what if you had the python code in the execute method iterate over the cursor and yield out rows or small chunks of rows.

something like:

            with sqlite_timelimit(conn, time_limit_ms):
                try:
                    cursor = conn.cursor()
                    cursor.execute(sql, params if params is not None else {})
                except:
                    ...
            max_returned_rows = self.ds.max_returned_rows
            if max_returned_rows == page_size:
                max_returned_rows += 1
                if max_returned_rows and truncate:
                    for i, row in enumerate(cursor):
                        yield row
                        if i == max_returned_rows - 1:
                            break
                else:
                    for row in cursor:
                        yield row
                    truncated = False                  

this kind of thing works well with a postgres server side cursor, but i’m not sure if it will hold for sqlite.

you would still spend about the same amount of time in python and would be contending for the gil, but it would be could be non blocking.

depending on the data flow, this could also some benefit for memory. (data stays in more compact sqlite-land until you need it)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Simon Willison Twitterissä: "Weeknotes: Parallel SQL queries ...
Weeknotes: Parallel SQL queries for Datasette, plus some middleware tricks ... Research: demonstrate if parallel SQL queries are worthwhile · Issue #1727 ...
Read more >
Understanding parallel queries in SQL Server 7.0
When SQL Server executes a query in parallel, it breaks down this single ... To determine if a query is a good candidate...
Read more >
Parallelism doesn't provide any performance gains
My select queries are getting pretty slow so I have began experimenting with parallelism but it seems that I don't see any performance...
Read more >
Parallel processing: Using parallel SQL effectively - TechTarget
The lesson is obvious: If every concurrent SQL in the system tries to use all the resources of the system, parallel makes performance...
Read more >
Slow Parallel SQL Server query, almost instant in serial
When I run the parallel query with the live execution plan enabled, the operator highlighted in green (Clustered Index scan on TABLE2 )...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found