DagsterUserCodeUnreachableError after 1500 successful jobs and local python repository
See original GitHub issueSummary
Hi!
We are running a Dagster trial by having it orchestrate a backfilling operation of our entire dataset into a new Database solution. Currently we are just utilizing a QueuedRunCoordinator in a Dagster Daemon container running on one node and then Dagit and Postgresql running on another node.
The node that I am running the daemon on is very powerful. It is a bare-metal server with 64 cores and 128 threads. I am scheduling more than 2000 jobs on Dagster by utilizing the backfilling feature and each job can result in up to 50+ dags due to it processing parquet files with the DynamicOutput feature.
The first issue I ran into was that Postgres could not handle more than 100 connections at a time and since the QueuedRunCoordinator could spin up more than 100 parallel jobs even if I set max_concurrent_runs: 2
since it only affects the number of parallel jobs not the amount of processes created and as a result the amount of connections established to Postgres. I solved this by utilizing pgbouncer and configuring it in pool_mode transaction and allowing 5000 connection to be made.
My current issue is a bit more difficult to understand from my point of view. As I said I schedule a large number of backfilling operations and let it run during the night. Each job finishes quite quickly, usually within 10 seconds. Around 1500 jobs complete without issue then suddenly as it moves over to the next scheduled backfilling setup with another 1000 jobs all of the jobs fail very quickly and with the error
This run has been marked as failed from outside the execution context.
And the error message
dagster.core.errors.DagsterRepositoryLocationLoadError: Failure loading repositories.py: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
Stack Trace:
File "/usr/local/lib/python3.9/site-packages/dagster/daemon/workspace.py", line 107, in _load_location
location = self._create_location_from_origin(origin)
File "/usr/local/lib/python3.9/site-packages/dagster/daemon/workspace.py", line 129, in _create_location_from_origin
return GrpcServerRepositoryLocation(
File "/usr/local/lib/python3.9/site-packages/dagster/core/host_representation/repository_location.py", line 576, in __init__
self._external_repositories_data = sync_get_streaming_external_repositories_data_grpc(
File "/usr/local/lib/python3.9/site-packages/dagster/api/snapshot_repository.py", line 21, in sync_get_streaming_external_repositories_data_grpc
external_repository_chunks = list(
File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 260, in streaming_external_repository
for res in self._streaming_query(
File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 124, in _streaming_query
raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating responses: 'load_avg_local_partition_set'"
debug_error_string = "{"created":"@1652739579.246271510","description":"Error received from peer unix:/tmp/tmpy86ka9fu","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Exception iterating responses: 'load_avg_local_partition_set","grpc_status":2}"
>
Stack Trace:
File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 122, in _streaming_query
yield from response_stream
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
raise self
File "/usr/local/lib/python3.9/site-packages/dagster/core/instance/__init__.py", line 1732, in launch_run
self._run_launcher.launch_run(LaunchRunContext(pipeline_run=run, workspace=workspace))
File "/usr/local/lib/python3.9/site-packages/dagster/core/launcher/default_run_launcher.py", line 107, in launch_run
repository_location = context.workspace.get_location(
File "/usr/local/lib/python3.9/site-packages/dagster/daemon/workspace.py", line 58, in get_location
raise DagsterRepositoryLocationLoadError(
I do not understand how I could get this error when I was able to load the repository just fine for the previous 1500 jobs?
Additional Info about Your Environment
Dependencies
[tool.poetry.dependencies]
python = "3.9"
dagit = "0.14.15"
dagster-postgres = "0.14.15"
dagster = "0.14.15"
Dagit + Postgres Node
128 Cores, 256 GB Memory, NVME RAID 0 Storage
docker-compose.yaml
services:
dagit:
image: docker.company.com/department/dagster:latest
network_mode: host
restart: unless-stopped
env_file:
- .env
depends_on:
- "pgbouncer"
postgres:
image: postgres:13.3
restart: unless-stopped
env_file:
- .env
volumes:
- ./.data:/var/lib/postgresql/data
pgbouncer:
image: bitnami/pgbouncer:1.17.0
restart: unless-stopped
ports:
- 5432:6432
env_file:
- .env
depends_on:
- "postgres"
Daemon Node
128 Cores, 256 GB Memory, NVME RAID 0 Storage
docker-compose.yaml
services:
daemon:
image: docker.company.com/department/dagster:latest
network_mode: host
restart: unless-stopped
env_file:
- .env
command: "dagster-daemon run"
Shared config
dagster.yaml
run_storage:
module: dagster_postgres.run_storage
class: PostgresRunStorage
config:
postgres_db:
username: postgres
password:
env: POSTGRES_PASSWORD
hostname:
env: POSTGRES_HOST
db_name: postgres
port: 5432
event_log_storage:
module: dagster_postgres.event_log
class: PostgresEventLogStorage
config:
postgres_db:
username: postgres
password:
env: POSTGRES_PASSWORD
hostname:
env: POSTGRES_HOST
db_name: postgres
port: 5432
schedule_storage:
module: dagster_postgres.schedule_storage
class: PostgresScheduleStorage
config:
postgres_db:
username: postgres
password:
env: POSTGRES_PASSWORD
hostname:
env: POSTGRES_HOST
db_name: postgres
port: 5432
run_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
max_concurrent_runs: 2
workspace.yaml
load_from:
- python_file:
relative_path: "dagster_pipelines/repositories.py"
working_directory: .
Message from the maintainers:
Impacted by this bug? Give it a 👍. We factor engagement into prioritization.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
Thanks! The attached PR should give better error handling with a stack trace when this happens, but I’m still having trouble figuring out how it could happen 😃 A look at the code could be really helpful.
Thanks Anton - I think that new stack trace gives us what we need to sort this out, will report back with what we find.