[Bug] [Jobs] Logs from separate jobs are mixed together
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Dashboard
What happened + What you expected to happen
- When running multiple jobs on a remote cluster, logs from multiple jobs are mixed together.
- I’d expect each job’s logs are isolated to the code it runs + the logs from the tasks/actors it spins up.
- Logs attached (note actual IPs have been replaced with
IP_ADDRESS
)
Versions / Dependencies
Ray 1.9.2
Reproduction script
For running the jobs:
import time
from ray.dashboard.modules.job.sdk import JobSubmissionClient
from ray.dashboard.modules.job.common import JobStatus
address = "YOUR REMOTE RAY CLUSTER HERE"
client = JobSubmissionClient(address)
job_ids = []
for idx in range(3):
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint=f"python job.py {idx}",
)
job_ids.append(job_id)
def wait_until_finish(job_id):
start = time.time()
timeout = 1000
while time.time() - start <= timeout:
status_info = client.get_job_status(job_id)
status = status_info.status
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
for job_id in job_ids:
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
print(f"Logs for {job_id}:\n\n{logs}")
The job itself:
import sys
import time
import ray
job_id = sys.argv[1]
print(f"This is job {job_id}")
# Issue happens both when actor has its own pod on k8s (num_cpus=1) and when it
# shares a pod with other actors (num_cpus <= .5)
@ray.remote(num_cpus=1)
class Printer:
def f(self):
for i in range(20):
print(f"This is Printer in job {job_id}")
time.sleep(1)
handle = Printer.remote()
ray.get(handle.f.remote())
Anything else
Running on a Ray Cluster deployed on k8s
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
[Jobs] Setting `RAY_LOG_TO_STDERR` results in empty job ...
When RAY_LOG_TO_STDERR is set, a job doesn't receive the logs from actors it creates. ... Put test_job.py and run.py in the same folder ......
Read more >Table 1. Job openings levels and rates by industry and region ...
Table 1. Job openings levels and rates by industry and region, seasonally adjusted. Table 1. Job openings levels and rates by ... Mining...
Read more >Koch Careers
Newest jobs at Koch companies. Enable personalized jobs. Do you speak multiple languages? To see job listings posted in a different ...
Read more >Jobs & Careers at ADP
We welcome big-thinking people… · We believe in who you are, what you know and where you hope to go with your career....
Read more >Careers at Regeneron | Regeneron Job Opportunities
Regeneron is a place where new ideas are welcome. Know about a career with Regeneron and Search for the Jobs in Research and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This is actually a Ray Core bug unrelated to Jobs. I was able to reproduce it using a script of the form
on the head node. Sadly, it seems to only happen sometimes (randomly), and I’m only able to reproduce it on a physical multinode cluster. I’ll follow up with a new Github issue with more details.
My memory is I was also unable to reproduce on a local machine, so that’s consistent with my experience.