Parallel compile_sql requests to dbt RPC server cause tasks to never exit the running state.
See original GitHub issueDescribe the bug
Parallel compile_sql requests (at just moderate volumes) to dbt RPC server cause tasks to never exit the running state. We also notice several child dbt rpc processes getting created, which may or may not be symptomatic of the perpetual running state.
Steps To Reproduce
Use the following Dockerfile, which uses the dbt official image, simply extend it to use the root user, install ps linux utility, and bootstrap a project w/ dbt init:
FROM fishtownanalytics/dbt:0.16.1
USER root
# We need ps to see the spawned dbt rpc processes.
RUN apt-get update && apt-get install -y procps
WORKDIR /
RUN dbt init my_project
WORKDIR /my_project
On the host, cd into the directory that contains this Dockerfile and run:
# build the image:
docker build --no-cache -t dbt_issue_2484 .
# run the dbt rpc server, exposing port 8580 to the host
docker run --publish 8580:8580 -it dbt_issue_2484 dbt rpc
With the rpc server running, in another terminal, find the running Docker container, and bash-in so we can watch the process list with ps:
[sudo] docker ps
<get container id>
[sudo] docker exec -it <container id> bash
Now, inside the container “watch” the instance of dbt rpc, which there will be only one, for now:
watch 'ps aux | grep rpc'
Now, back on the host machine, let’s hit the RPC server w/ multiple, serial compile_sql requests to show it works.
Create a file called my_first_query.json w/ contents (straight out of the official docs) - Note the TASK_ID we will fill-in w/ a random number for each request, using sed, in a moment.
{
"jsonrpc": "2.0",
"method": "compile_sql",
"id": "TASK_ID",
"params": {
"timeout": 60,
"sql": "c2VsZWN0IHt7IDEgKyAxIH19IGFzIGlk",
"name": "my_first_query"
}
}
Then create a poll.json:
{
"jsonrpc": "2.0",
"method": "poll",
"id": "TASK_ID",
"params": {
"request_token": "TOKEN",
"logs": false,
"logs_start": 0
}
}
And two shell scripts, one to compile_sql and the other to poll for results:
compile_sql.sh:
#!/bin/bash
id=$RANDOM
# get the contents of my_first_query and change TASK_ID to a random int.
query_json=$(cat my_first_query.json | sed --expression "s/TASK_ID/$id/g")
# we only really need the token back.
curl -s -X POST -H 'Content-Type: application/json' -d "$query_json" http://localhost:8580/jsonrpc | jq -r .result.request_token
poll.sh (which takes a token as its only arg):
#!/bin/bash
token=$1
id=$RANDOM
poll_json=$(cat poll.json | sed --expression "s/TOKEN/$token/g" | sed --expression "s/TASK_ID/$id/g")
# truncates response to just state and elapsed time.
curl -s -X POST -H 'Content-Type: application/json' -d "$poll_json" http://localhost:8580/jsonrpc | jq '.result | .state, .elapsed'
Now, chmod em’ and try them:
> ./compile_sql.sh
1c5368d0-d610-4efd-99e7-dda256d3ede0
./poll.sh 1c5368d0-d610-4efd-99e7-dda256d3ede0
"success"
0.113002
Great! Also note that the process list still contains a single dbt rpc.
Now, let’s go parallel. Make a parallel.sh script w/ the following contents. This will invoke the compile_sql.sh script 20 times with max processes of 8.
#!/bin/bash
xargs -I % -P 8 ./compile_sql.sh \
< <(printf '%s\n' {1..20})
Then run it:
./parallel.sh
a28e2ceb-a6c5-41d7-bbc6-b03eed263f1e
da6339a3-37ed-4e66-9fd4-d51dd247812a
450fd50d-5a4f-4649-b7d7-9ac2dda89f67
1cc385d0-7a27-4193-bf11-126b3f9b0490
93039ba2-d2fb-4b75-a84a-d7e1bd27b9ab
02fa72e4-c75e-42fe-ac60-36d184399031
920c0796-eac3-4106-b1ed-e62f2d29e498
e45bed44-3908-45b9-9a03-bc55bb19fcdb
788bb3f9-5e6c-4001-8990-cd74450117bf
6e41947d-0b23-4f01-9edb-8b823e86644a
b699d934-07ba-4378-9b4e-61d6805eb629
e37b1a70-a63b-4142-a8eb-434e6bb8d947
e7a9101c-5625-48d3-a313-4438bd77bee0
5b1d98de-7d71-40c7-b0ff-4ce9296db977
...
...
Now, notice your ps output. It will likely contain child dbt rpc processes. This could be expected behaviour, we’re not sure; but we did find that once this happens, thats when the polls go into perpetual running state. If you do not have multiple dbt rpc processes, run this again until you do.
Finally, now for the star of the show… try polling again. Try some of the tokens printed to your CLI:
./poll.sh 5b1d98de-7d71-40c7-b0ff-4ce9296db977
"running"
89.967127
Note, this is my_first_query.sql i.e, a simple select {{ 1 + 1 }} as id has been running for 89 seconds now, and will never terminate and provide results.
In addition, the dbt rpc server is now toast. 🍞 Any subsequent compile_sql or poll requests will never return results.
Sorry, that is a bit long-winded. It’s a specific workflow that needs to be followed to produce this behaviour.
Expected behavior
At any volume, dbt rpc should eventually complete compile_sql requests.
System information
Which database are you using dbt with?
- redshift
Technically redshift, but only because the default dbt init sets up a RedShift profiles.yml by default. We are running compile_sql dbt RPC function only, so, there are no real dbt runs happening on any warehouse.
The output of dbt --version:
docker run dbt_issue_2484 dbt --version
installed version: 0.16.1
latest version: 0.16.1
Up to date!
The operating system you’re using:
The one that underlies dbt blessed Docker:
docker run dbt_issue_2484 lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster
The output of python --version:
docker run dbt_issue_2484 python --version
Python 3.8.1
Additional context
This is actually effecting a (small, but important part of a) production application for us. We’re happy to hot-patch files rather than wait until the next release, if possible. So, if you have any easy fixes, please let us know.
Possibly related to https://github.com/fishtown-analytics/dbt/issues/1848
Happy Friday 🎉
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:13 (7 by maintainers)

Top Related StackOverflow Question
I’ve spent some time on this and I figured I’d get it up here for you all to chew on! There are really two issues here:
One, there is definitely a deadlock going on here on process fork. This has to do with a number of apparently well-known issues in Python where forking a new thread in a multithreaded environment is bad. I thought this was only true on macos, but it’s actually a problem on all OSes that support fork() - macos is just more obvious. Basically, if python does a fork()-without-exec() in one thread while another is holding an important internal lock, the forked process will copy the memory, but not the thread. Crucially, the lock will still be held (because that’s a process-level memory item) but the thread that will unlock it in the parent doesn’t exist in the child. There are assorted places this happens in Python, and recent releases of Python 3.x appear to have been a long game of wack-a-mole on the relevant bugs, culminating in a documentation note that threads forking processes is, basically a terrible idea. Noted! This is a fundamental design flaw in the RPC server.
We’re going to talk in the coming days about how exactly to mitigate this/come up with a timeline. As fork()'s copy-on-write semantics were really desirable (it’s much faster for large manifests!), I’m thinking of a custom fork-server-like model. The idea is we’d fork a process early to fork new tasks for requests before spinning up the webserver (and therefore threads!). We’d make that early-forked process control the manifest so when it called fork() its children would receive it.
Two, the results are mismatching the inputs even when using
spawn(andforkserver). I’ve only started on tracking this down, but here’s what I do know: Internally the server wraps every request/task/result in a RequestTaskHandler object, and it knows the arguments, and I’ve determined that it definitely does have the correct arguments for each! However, the results for each task are unique but reflect the “wrong argument” that it got, so the issue must happen somewhere between handling the http request and performing the task dispatch.Update: And it does! The issue is that there’s a race where
set_argsis writing to the same object for each task, because it hasn’t yet kicked off the process. I’ll open a PR to fix this for 0.17.1, at least.thx @beckjake & @drewbanin !!