Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] [Jobs] Logs from separate jobs are mixed together

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Dashboard

What happened + What you expected to happen

When running multiple jobs on a remote cluster, logs from multiple jobs are mixed together.
I’d expect each job’s logs are isolated to the code it runs + the logs from the tasks/actors it spins up.
Logs attached (note actual IPs have been replaced with IP_ADDRESS)

logs.txt

Versions / Dependencies

Ray 1.9.2

Reproduction script

For running the jobs:

import time

from ray.dashboard.modules.job.sdk import JobSubmissionClient
from ray.dashboard.modules.job.common import JobStatus

address = "YOUR REMOTE RAY CLUSTER HERE"
client = JobSubmissionClient(address)

job_ids = []
for idx in range(3):
    job_id = client.submit_job(
        # Entrypoint shell command to execute
        entrypoint=f"python job.py {idx}",
    )
    job_ids.append(job_id)


def wait_until_finish(job_id):
    start = time.time()
    timeout = 1000
    while time.time() - start <= timeout:
        status_info = client.get_job_status(job_id)
        status = status_info.status
        print(f"status: {status}")
        if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
            break
        time.sleep(1)


for job_id in job_ids:
    wait_until_finish(job_id)
    logs = client.get_job_logs(job_id)
    print(f"Logs for {job_id}:\n\n{logs}")

The job itself:

import sys
import time

import ray

job_id = sys.argv[1]
print(f"This is job {job_id}")

# Issue happens both when actor has its own pod on k8s (num_cpus=1) and when it
# shares a pod with other actors (num_cpus <= .5)
@ray.remote(num_cpus=1)
class Printer:
    def f(self):
        for i in range(20):
            print(f"This is Printer in job {job_id}")
            time.sleep(1)


handle = Printer.remote()
ray.get(handle.f.remote())

Anything else

Running on a Ray Cluster deployed on k8s

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

architkulkarnicommented, Feb 2, 2022

This is actually a Ray Core bug unrelated to Jobs. I was able to reproduce it using a script of the form

from subprocess import Popen, PIPE
import time

processes = [Popen(['python', 'run_printer_actor.py', str(i)], stdout=PIPE, stderr=PIPE) for i in range(8)]

time.sleep(6)
for i in range(8):
    stdout, stderr = processes[i].communicate()

    print(stdout.decode("utf-8"))

on the head node. Sadly, it seems to only happen sometimes (randomly), and I’m only able to reproduce it on a physical multinode cluster. I’ll follow up with a new Github issue with more details.

1reaction

spolcyncommented, Jan 21, 2022

My memory is I was also unable to reproduce on a local machine, so that’s consistent with my experience.