a stateless and fast-initializing dbt RPC server
See original GitHub issueDescribe the feature
Hey folks!
We’re experimenting with dbt RPC on Cloud Run (Google’s serverless Docker-based container service) at our company.
However the dbt RPC implementation has a couple limitations that prevent it from being deployed there.
-
When turning on the RPC server w/
dbt rpc
, the server performs an initial compilation step. This process can be sluggish for large projects. While this is usually not an issue if dbt RPC is running on a VM (EC2, GCE, etc), it does become a problem because of the abstraction Cloud Run provides involves spinning-up new containers (and therefore dbt RPC servers) when load spikes. -
The asynchronous nature of many of the dbt RPC server tasks/methods does not fit the Cloud Run stateless model. Because Cloud Run operates on a container-level abstraction, it cannot guarantee that polling requests will match the same container that kicked off the job. The assumption when using Cloud Run is request/response is done in a single transaction. Usually folks use Cloud Run for serving RESTful APIs.
Can I suggest some enhancements that would make a dbt RPC server deployable on this platform, and more widely deployable to other services that naturally breathe with incoming load by spinning-up container clones?
-
It would be great if we could compile our project upfront, once for the RPC server. Perhaps a
dbt compile
at Docker build time and adbt rpc --cached
flag which would bootstrap itself from disk instead of atdbt rpc
runtime. At least for our application, the models/macros do not change once a release is made, so a one-time project compile is actually safe. -
An additional field in the asynchronous tasks/methods that allow the user to specify synchronous operation. Actually, for our specific use-case we only need the
compile_sql
dbt RPC method task/method, but this idea could be extended for other tasks/methods.
Perhaps an additional field in the params
object could indicate that you want a synchronous response.
Currently, the payloads look like:
{
"jsonrpc": "2.0",
"method": "compile_sql",
"id": "2db9a2fe-9a39-41ef-828c-25e04dd6b07d",
"params": {
"timeout": 60,
"sql": "c2VsZWN0IHt7IDEgKyAxIH19IGFzIGlk",
"name": "my_first_query"
}
}
Could be:
{
"jsonrpc": "2.0",
"method": "compile_sql",
"id": "2db9a2fe-9a39-41ef-828c-25e04dd6b07d",
"params": {
"timeout": 60,
"sql": "c2VsZWN0IHt7IDEgKyAxIH19IGFzIGlk",
"name": "my_first_query",
"synchronous": true
}
}
At least for the compile_sql
task, even very complex models/macros usually return to us in under 1 second, so asynchronous operation (polling) is usually overkill.
Describe alternatives you’ve considered
We currently run dbt RPC on Google Compute Engine. But, it’s more management than we’d like.
Who will this benefit?
Data Engineers looking to deploy dbt RPC in a serverless Docker-based environment.
Are you interested in contributing this feature?
Personally, my python-fu is pretty weak, but we’d be super interested in reporting feedback.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:8 (3 by maintainers)
I’m going to close this issue. I will say that this topic—a fast-initializing, reliable, and “stateless” server—is something we’ve been thinking and talking about a lot lately, as we plan for the next-generation dbt Server
@hugohjerten, sorry for the delay.
Yes and no. We were unsuccessful in getting dbt RPC running in Cloud Run (cleanly).
We took a different approach instead. Rather than perform dbt RPC compile_sql at runtime, we turn on the dbt RPC server when we build our Docker Image and pre-compile all the macro argument combinations, saving the results into files inside the image.
We then, look up those pre-compiled templates at runtime, based off a conventional filename. That filename is made-up of the macro name, and all of the key/value pairs that built it.
It works for us!