question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] [Jobs] Logs from separate jobs are mixed together

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Dashboard

What happened + What you expected to happen

  1. When running multiple jobs on a remote cluster, logs from multiple jobs are mixed together.
  2. I’d expect each job’s logs are isolated to the code it runs + the logs from the tasks/actors it spins up.
  3. Logs attached (note actual IPs have been replaced with IP_ADDRESS)

logs.txt

Versions / Dependencies

Ray 1.9.2

Reproduction script

For running the jobs:

import time

from ray.dashboard.modules.job.sdk import JobSubmissionClient
from ray.dashboard.modules.job.common import JobStatus

address = "YOUR REMOTE RAY CLUSTER HERE"
client = JobSubmissionClient(address)

job_ids = []
for idx in range(3):
    job_id = client.submit_job(
        # Entrypoint shell command to execute
        entrypoint=f"python job.py {idx}",
    )
    job_ids.append(job_id)


def wait_until_finish(job_id):
    start = time.time()
    timeout = 1000
    while time.time() - start <= timeout:
        status_info = client.get_job_status(job_id)
        status = status_info.status
        print(f"status: {status}")
        if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
            break
        time.sleep(1)


for job_id in job_ids:
    wait_until_finish(job_id)
    logs = client.get_job_logs(job_id)
    print(f"Logs for {job_id}:\n\n{logs}")

The job itself:

import sys
import time

import ray

job_id = sys.argv[1]
print(f"This is job {job_id}")

# Issue happens both when actor has its own pod on k8s (num_cpus=1) and when it
# shares a pod with other actors (num_cpus <= .5)
@ray.remote(num_cpus=1)
class Printer:
    def f(self):
        for i in range(20):
            print(f"This is Printer in job {job_id}")
            time.sleep(1)


handle = Printer.remote()
ray.get(handle.f.remote())

Anything else

Running on a Ray Cluster deployed on k8s

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
architkulkarnicommented, Feb 2, 2022

This is actually a Ray Core bug unrelated to Jobs. I was able to reproduce it using a script of the form

from subprocess import Popen, PIPE
import time

processes = [Popen(['python', 'run_printer_actor.py', str(i)], stdout=PIPE, stderr=PIPE) for i in range(8)]

time.sleep(6)
for i in range(8):
    stdout, stderr = processes[i].communicate()

    print(stdout.decode("utf-8"))

on the head node. Sadly, it seems to only happen sometimes (randomly), and I’m only able to reproduce it on a physical multinode cluster. I’ll follow up with a new Github issue with more details.

1reaction
spolcyncommented, Jan 21, 2022

My memory is I was also unable to reproduce on a local machine, so that’s consistent with my experience.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Jobs] Setting `RAY_LOG_TO_STDERR` results in empty job ...
When RAY_LOG_TO_STDERR is set, a job doesn't receive the logs from actors it creates. ... Put test_job.py and run.py in the same folder ......
Read more >
Table 1. Job openings levels and rates by industry and region ...
Table 1. Job openings levels and rates by industry and region, seasonally adjusted. Table 1. Job openings levels and rates by ... Mining...
Read more >
Koch Careers
Newest jobs at Koch companies. Enable personalized jobs. Do you speak multiple languages? To see job listings posted in a different ...
Read more >
Jobs & Careers at ADP
We welcome big-thinking people… · We believe in who you are, what you know and where you hope to go with your career....
Read more >
Careers at Regeneron | Regeneron Job Opportunities
Regeneron is a place where new ideas are welcome. Know about a career with Regeneron and Search for the Jobs in Research and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found