question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel compile_sql requests to dbt RPC server cause tasks to never exit the running state.

See original GitHub issue

Describe the bug

Parallel compile_sql requests (at just moderate volumes) to dbt RPC server cause tasks to never exit the running state. We also notice several child dbt rpc processes getting created, which may or may not be symptomatic of the perpetual running state.

Steps To Reproduce

Use the following Dockerfile, which uses the dbt official image, simply extend it to use the root user, install ps linux utility, and bootstrap a project w/ dbt init:

FROM fishtownanalytics/dbt:0.16.1
USER root
# We need ps to see the spawned dbt rpc processes.
RUN apt-get update && apt-get install -y procps
WORKDIR /
RUN dbt init my_project
WORKDIR /my_project

On the host, cd into the directory that contains this Dockerfile and run:

# build the image:
docker build --no-cache -t dbt_issue_2484 .
# run the dbt rpc server, exposing port 8580 to the host
docker run --publish 8580:8580 -it dbt_issue_2484 dbt rpc

With the rpc server running, in another terminal, find the running Docker container, and bash-in so we can watch the process list with ps:

[sudo] docker ps
<get container id>
[sudo] docker exec -it <container id> bash 

Now, inside the container “watch” the instance of dbt rpc, which there will be only one, for now:

watch 'ps aux | grep rpc'

Now, back on the host machine, let’s hit the RPC server w/ multiple, serial compile_sql requests to show it works.

Create a file called my_first_query.json w/ contents (straight out of the official docs) - Note the TASK_ID we will fill-in w/ a random number for each request, using sed, in a moment.

{
    "jsonrpc": "2.0",
    "method": "compile_sql",
    "id": "TASK_ID",
    "params": {
        "timeout": 60,
        "sql": "c2VsZWN0IHt7IDEgKyAxIH19IGFzIGlk",
        "name": "my_first_query"
    }
}

Then create a poll.json:

{
    "jsonrpc": "2.0",
    "method": "poll",
    "id": "TASK_ID",
    "params": {
        "request_token": "TOKEN",
        "logs": false,
        "logs_start": 0
    }
}

And two shell scripts, one to compile_sql and the other to poll for results:

compile_sql.sh:

#!/bin/bash

id=$RANDOM
# get the contents of my_first_query and change TASK_ID to a random int.
query_json=$(cat my_first_query.json | sed --expression "s/TASK_ID/$id/g")
# we only really need the token back.
curl -s -X POST -H 'Content-Type: application/json' -d "$query_json" http://localhost:8580/jsonrpc | jq -r .result.request_token

poll.sh (which takes a token as its only arg):

#!/bin/bash

token=$1
id=$RANDOM

poll_json=$(cat poll.json | sed --expression "s/TOKEN/$token/g" | sed --expression "s/TASK_ID/$id/g")
# truncates response to just state and elapsed time.
curl -s -X POST -H 'Content-Type: application/json' -d "$poll_json" http://localhost:8580/jsonrpc | jq '.result | .state, .elapsed'

Now, chmod em’ and try them:

> ./compile_sql.sh
1c5368d0-d610-4efd-99e7-dda256d3ede0
./poll.sh 1c5368d0-d610-4efd-99e7-dda256d3ede0
"success"
0.113002

Great! Also note that the process list still contains a single dbt rpc.

Now, let’s go parallel. Make a parallel.sh script w/ the following contents. This will invoke the compile_sql.sh script 20 times with max processes of 8.

#!/bin/bash

xargs -I % -P 8 ./compile_sql.sh \
< <(printf '%s\n' {1..20})

Then run it:

./parallel.sh
a28e2ceb-a6c5-41d7-bbc6-b03eed263f1e
da6339a3-37ed-4e66-9fd4-d51dd247812a
450fd50d-5a4f-4649-b7d7-9ac2dda89f67
1cc385d0-7a27-4193-bf11-126b3f9b0490
93039ba2-d2fb-4b75-a84a-d7e1bd27b9ab
02fa72e4-c75e-42fe-ac60-36d184399031
920c0796-eac3-4106-b1ed-e62f2d29e498
e45bed44-3908-45b9-9a03-bc55bb19fcdb
788bb3f9-5e6c-4001-8990-cd74450117bf
6e41947d-0b23-4f01-9edb-8b823e86644a
b699d934-07ba-4378-9b4e-61d6805eb629
e37b1a70-a63b-4142-a8eb-434e6bb8d947
e7a9101c-5625-48d3-a313-4438bd77bee0
5b1d98de-7d71-40c7-b0ff-4ce9296db977
...
...

Now, notice your ps output. It will likely contain child dbt rpc processes. This could be expected behaviour, we’re not sure; but we did find that once this happens, thats when the polls go into perpetual running state. If you do not have multiple dbt rpc processes, run this again until you do.

Finally, now for the star of the show… try polling again. Try some of the tokens printed to your CLI:

./poll.sh 5b1d98de-7d71-40c7-b0ff-4ce9296db977
"running"
89.967127

Note, this is my_first_query.sql i.e, a simple select {{ 1 + 1 }} as id has been running for 89 seconds now, and will never terminate and provide results.

In addition, the dbt rpc server is now toast. 🍞 Any subsequent compile_sql or poll requests will never return results.


Sorry, that is a bit long-winded. It’s a specific workflow that needs to be followed to produce this behaviour.

Expected behavior

At any volume, dbt rpc should eventually complete compile_sql requests.

System information

Which database are you using dbt with?

  • redshift

Technically redshift, but only because the default dbt init sets up a RedShift profiles.yml by default. We are running compile_sql dbt RPC function only, so, there are no real dbt runs happening on any warehouse.

The output of dbt --version:

docker run dbt_issue_2484 dbt --version
installed version: 0.16.1
   latest version: 0.16.1

Up to date!

The operating system you’re using:

The one that underlies dbt blessed Docker:

docker run dbt_issue_2484 lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

The output of python --version:

docker run dbt_issue_2484 python --version
Python 3.8.1

Additional context

This is actually effecting a (small, but important part of a) production application for us. We’re happy to hot-patch files rather than wait until the next release, if possible. So, if you have any easy fixes, please let us know.

Possibly related to https://github.com/fishtown-analytics/dbt/issues/1848

Happy Friday 🎉

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
beckjakecommented, Jun 16, 2020

I’ve spent some time on this and I figured I’d get it up here for you all to chew on! There are really two issues here:

One, there is definitely a deadlock going on here on process fork. This has to do with a number of apparently well-known issues in Python where forking a new thread in a multithreaded environment is bad. I thought this was only true on macos, but it’s actually a problem on all OSes that support fork() - macos is just more obvious. Basically, if python does a fork()-without-exec() in one thread while another is holding an important internal lock, the forked process will copy the memory, but not the thread. Crucially, the lock will still be held (because that’s a process-level memory item) but the thread that will unlock it in the parent doesn’t exist in the child. There are assorted places this happens in Python, and recent releases of Python 3.x appear to have been a long game of wack-a-mole on the relevant bugs, culminating in a documentation note that threads forking processes is, basically a terrible idea. Noted! This is a fundamental design flaw in the RPC server.

We’re going to talk in the coming days about how exactly to mitigate this/come up with a timeline. As fork()'s copy-on-write semantics were really desirable (it’s much faster for large manifests!), I’m thinking of a custom fork-server-like model. The idea is we’d fork a process early to fork new tasks for requests before spinning up the webserver (and therefore threads!). We’d make that early-forked process control the manifest so when it called fork() its children would receive it.

Two, the results are mismatching the inputs even when using spawn (and forkserver). I’ve only started on tracking this down, but here’s what I do know: Internally the server wraps every request/task/result in a RequestTaskHandler object, and it knows the arguments, and I’ve determined that it definitely does have the correct arguments for each! However, the results for each task are unique but reflect the “wrong argument” that it got, so the issue must happen somewhere between handling the http request and performing the task dispatch.

Update: And it does! The issue is that there’s a race where set_args is writing to the same object for each task, because it hasn’t yet kicked off the process. I’ll open a PR to fix this for 0.17.1, at least.

2reactions
jarscommented, Jun 21, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallel compile_sql requests to dbt RPC server cause tasks ...
Describe the bug Parallel compile_sql requests (at just moderate volumes) to dbt RPC server cause tasks to never exit the running state.
Read more >
rpc | dbt Developer Hub
This server compiles and runs queries in the context of a dbt project. Additionally, the RPC server provides methods that enable you to...
Read more >
dbt Guide - GitLab
dbt compile - compiles all models. This isn't a command you will need to run regularly. dbt will compile the models when you...
Read more >
Oh, my dbt (data build tool) - Towards Data Science
sql file, I just created a database named db. Python Dockerfile: FROM python:3.8. COPY requirements.txt requirements.txt. RUN pip install -r ...
Read more >
Source code for dagster_dbt.rpc.resources - Dagster Docs
[docs]class DbtRpcResource(DbtResource): """A client for a dbt RPC server. To use this as a dagster resource, we recommend using :func:`dbt_rpc_resource ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found