Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DagsterUserCodeUnreachableError after 1500 successful jobs and local python repository

See original GitHub issue

Summary

Hi!

We are running a Dagster trial by having it orchestrate a backfilling operation of our entire dataset into a new Database solution. Currently we are just utilizing a QueuedRunCoordinator in a Dagster Daemon container running on one node and then Dagit and Postgresql running on another node.

The node that I am running the daemon on is very powerful. It is a bare-metal server with 64 cores and 128 threads. I am scheduling more than 2000 jobs on Dagster by utilizing the backfilling feature and each job can result in up to 50+ dags due to it processing parquet files with the DynamicOutput feature.

The first issue I ran into was that Postgres could not handle more than 100 connections at a time and since the QueuedRunCoordinator could spin up more than 100 parallel jobs even if I set max_concurrent_runs: 2 since it only affects the number of parallel jobs not the amount of processes created and as a result the amount of connections established to Postgres. I solved this by utilizing pgbouncer and configuring it in pool_mode transaction and allowing 5000 connection to be made.

My current issue is a bit more difficult to understand from my point of view. As I said I schedule a large number of backfilling operations and let it run during the night. Each job finishes quite quickly, usually within 10 seconds. Around 1500 jobs complete without issue then suddenly as it moves over to the next scheduled backfilling setup with another 1000 jobs all of the jobs fail very quickly and with the error

This run has been marked as failed from outside the execution context.

And the error message

dagster.core.errors.DagsterRepositoryLocationLoadError: Failure loading repositories.py: dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
File "/usr/local/lib/python3.9/site-packages/dagster/daemon/workspace.py", line 107, in _load_location
location = self._create_location_from_origin(origin)
File "/usr/local/lib/python3.9/site-packages/dagster/daemon/workspace.py", line 129, in _create_location_from_origin
return GrpcServerRepositoryLocation(
File "/usr/local/lib/python3.9/site-packages/dagster/core/host_representation/repository_location.py", line 576, in __init__
self._external_repositories_data = sync_get_streaming_external_repositories_data_grpc(
File "/usr/local/lib/python3.9/site-packages/dagster/api/snapshot_repository.py", line 21, in sync_get_streaming_external_repositories_data_grpc
external_repository_chunks = list(
File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 260, in streaming_external_repository
for res in self._streaming_query(
File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 124, in _streaming_query
raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating responses: 'load_avg_local_partition_set'"
debug_error_string = "{"created":"@1652739579.246271510","description":"Error received from peer unix:/tmp/tmpy86ka9fu","file":"src/core/lib/surface/call.cc","file_line":952,"grpc_message":"Exception iterating responses: 'load_avg_local_partition_set","grpc_status":2}"
>

Stack Trace:
File "/usr/local/lib/python3.9/site-packages/dagster/grpc/client.py", line 122, in _streaming_query
yield from response_stream
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 826, in _next
raise self

  File "/usr/local/lib/python3.9/site-packages/dagster/core/instance/__init__.py", line 1732, in launch_run
    self._run_launcher.launch_run(LaunchRunContext(pipeline_run=run, workspace=workspace))
  File "/usr/local/lib/python3.9/site-packages/dagster/core/launcher/default_run_launcher.py", line 107, in launch_run
    repository_location = context.workspace.get_location(
  File "/usr/local/lib/python3.9/site-packages/dagster/daemon/workspace.py", line 58, in get_location
    raise DagsterRepositoryLocationLoadError(

I do not understand how I could get this error when I was able to load the repository just fine for the previous 1500 jobs?

Additional Info about Your Environment

Dependencies

[tool.poetry.dependencies]
python = "3.9"
dagit = "0.14.15"
dagster-postgres = "0.14.15"
dagster = "0.14.15"

Dagit + Postgres Node

128 Cores, 256 GB Memory, NVME RAID 0 Storage

docker-compose.yaml

services:
  dagit:
    image: docker.company.com/department/dagster:latest
    network_mode: host
    restart: unless-stopped
    env_file:
      - .env
    depends_on:
      - "pgbouncer"
  postgres:
    image: postgres:13.3
    restart: unless-stopped
    env_file:
      - .env
    volumes:
      - ./.data:/var/lib/postgresql/data
  pgbouncer:
    image: bitnami/pgbouncer:1.17.0
    restart: unless-stopped
    ports:
      - 5432:6432
    env_file:
      - .env
    depends_on:
      - "postgres"

Daemon Node

128 Cores, 256 GB Memory, NVME RAID 0 Storage

docker-compose.yaml

services:
  daemon:
    image: docker.company.com/department/dagster:latest
    network_mode: host
    restart: unless-stopped
    env_file:
      - .env
    command: "dagster-daemon run"

Shared config

dagster.yaml

run_storage:
  module: dagster_postgres.run_storage
  class: PostgresRunStorage
  config:
    postgres_db:
      username: postgres
      password:
        env: POSTGRES_PASSWORD
      hostname:
        env: POSTGRES_HOST
      db_name: postgres
      port: 5432
event_log_storage:
  module: dagster_postgres.event_log
  class: PostgresEventLogStorage
  config:
    postgres_db:
      username: postgres
      password:
        env: POSTGRES_PASSWORD
      hostname:
        env: POSTGRES_HOST
      db_name: postgres
      port: 5432
schedule_storage:
  module: dagster_postgres.schedule_storage
  class: PostgresScheduleStorage
  config:
    postgres_db:
      username: postgres
      password:
        env: POSTGRES_PASSWORD
      hostname:
        env: POSTGRES_HOST
      db_name: postgres
      port: 5432
run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    max_concurrent_runs: 2

workspace.yaml

load_from:
  - python_file: 
      relative_path: "dagster_pipelines/repositories.py"
      working_directory: .

Message from the maintainers:

Impacted by this bug? Give it a 👍. We factor engagement into prioritization.

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

2reactions

gibsondancommented, May 17, 2022

Thanks! The attached PR should give better error handling with a stack trace when this happens, but I’m still having trouble figuring out how it could happen 😃 A look at the code could be really helpful.

1reaction

gibsondancommented, May 24, 2022

Thanks Anton - I think that new stack trace gives us what we need to sort this out, will report back with what we find.