Research: demonstrate if parallel SQL queries are worthwhile
See original GitHub issueI added parallel SQL query execution here:
My hunch is that this will take advantage of multiple cores, since Python’s sqlite3
module releases the GIL once a query is passed to SQLite.
I’d really like to prove this is the case though. Just not sure how to do it!
Larger question: is this performance optimization actually improving performance at all? Under what circumstances is it worthwhile?
Issue Analytics
- State:
- Created a year ago
- Comments:32 (30 by maintainers)
Top Results From Across the Web
Simon Willison Twitterissä: "Weeknotes: Parallel SQL queries ...
Weeknotes: Parallel SQL queries for Datasette, plus some middleware tricks ... Research: demonstrate if parallel SQL queries are worthwhile · Issue #1727 ...
Read more >Understanding parallel queries in SQL Server 7.0
When SQL Server executes a query in parallel, it breaks down this single ... To determine if a query is a good candidate...
Read more >Parallelism doesn't provide any performance gains
My select queries are getting pretty slow so I have began experimenting with parallelism but it seems that I don't see any performance...
Read more >Parallel processing: Using parallel SQL effectively - TechTarget
The lesson is obvious: If every concurrent SQL in the system tries to use all the resources of the system, parallel makes performance...
Read more >Slow Parallel SQL Server query, almost instant in serial
When I run the parallel query with the live execution plan enabled, the operator highlighted in green (Clustered Index scan on TABLE2 )...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK, I just got the most incredible result with that!
I started up a container running
bash
like this, from mydatasette
checkout. I’m mapping port 8005 on my laptop to port 8001 inside the container because laptop port 8001 was already doing something else:Then in
bash
I ran the following commands to install Datasette and its dependencies:Then I started Datasette against my
github.db
database (from github-to-sqlite.dogsheep.net/github.db) like this:I hit the following two URLs to compare the parallel v.s. not parallel implementations:
http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10
http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10&_noparallel=1
And… the parallel one beat the non-parallel one decisively, on multiple page refreshes!
Not parallel: 77ms
Parallel: 47ms
So yeah, I’m very confident this is a problem with the GIL. And I am absolutely stunned that @colesbury’s fork ran Datasette (which has some reasonably tricky threading and async stuff going on) out of the box!
from your analysis, it seems like the GIL is blocking on loading of the data from sqlite to python, (particularly in the
fetchmany
call)this is probably a simplistic idea, but what if you had the python code in the
execute
method iterate over the cursor and yield out rows or small chunks of rows.something like:
this kind of thing works well with a postgres server side cursor, but i’m not sure if it will hold for sqlite.
you would still spend about the same amount of time in python and would be contending for the gil, but it would be could be non blocking.
depending on the data flow, this could also some benefit for memory. (data stays in more compact sqlite-land until you need it)