question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EOFError: Ran out of input on Kubernetes Cluster

See original GitHub issue

What is the problem?

I deployed a Kubernetes setup with Ray through the documentation at https://docs.ray.io/en/master/cluster/kubernetes.html#interacting-with-a-ray-cluster when I then submit a job through ray submit my-cluster.yaml myscript.py it returns EOFError: Ran out of input

Stacktrace

2021-03-13 13:06:46,093 INFO command_runner.py:171 -- NodeUpdater: example-cluster-ray-head-mtw85: Running kubectl -n ray exec -it example-cluster-ray-head-mtw85 -- bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python ~/cartpole2.py)'
Traceback (most recent call last):
  File "/home/ray/cartpole2.py", line 20, in <module>
    agent = ppo.PPOTrainer(config, env=SELECT_ENV)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 121, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 513, in __init__
    super().__init__(config, logger_creator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 98, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 607, in setup
    self.env_creator = _global_registry.get(ENV_CREATOR, env)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/registry.py", line 140, in get
    return pickle.loads(value)
EOFError: Ran out of input
command terminated with exit code 1

Reproduction (REQUIRED)

  1. Setup a Kubernetes cluster as documented in https://docs.ray.io/en/master/cluster/kubernetes.html#k8s-cluster-launcher
  2. Run the file below by saving it and executing it with ray submit <yaml-step-1> <saved-file.py>
import ray
import ray.rllib.agents.ppo as ppo
import os
import shutil

ray.util.connect("127.0.0.1:10001")

CHECKPOINT_ROOT = "tmp/ppo/cart"
shutil.rmtree(CHECKPOINT_ROOT, ignore_errors=True, onerror=None)

ray_results = os.getenv("HOME") + "/ray_results/"
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

SELECT_ENV = "CartPole-v0"

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

N_ITER = 40
s = "{:3d} reward {:6.2f}/{:6.2f}/{:6.2f} len {:6.2f} saved {}"

for n in range(N_ITER):
  result = agent.train()
  file_name = agent.save(CHECKPOINT_ROOT)

  print(s.format(
    n + 1,
    result["episode_reward_min"],
    result["episode_reward_mean"],
    result["episode_reward_max"],
    result["episode_len_mean"],
    file_name
   ))
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
AmeerHajAlicommented, Jun 14, 2021

@DmitriGekhtman can you please follow up on this when you are back in office?

1reaction
AmeerHajAlicommented, Apr 26, 2021

@richardliaw / @sven1977 can you please answers Xavier’s question?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why do I get "Pickle - EOFError: Ran out of input" reading an ...
It is very likely that the pickled file is empty. It is surprisingly easy to overwrite a pickle file if you're copying and...
Read more >
Celery throws an error with django: EOFError: Ran out of input
I'm having an issue when using celery with django. When I run celery, I get this error: Unrecoverable error: PicklingError("Can't pickle.
Read more >
typeerror cannot pickle '_thread.lock' object multiprocessing - You ...
From what I can see, the Pickle module is causing the issue. it must be ... TypeError: cannot pickle '_thread.lock' object , EOFError:...
Read more >
Troubleshoot Dataflow errors - Google Cloud
These errors typically occur when some of your running Dataflow jobs use the same temp_location to stage temporary job files created when the...
Read more >
Troubleshooting kubeadm | Kubernetes
As with any program, you might run into an error installing or running kubeadm ... From a working control plane node in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found