Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible memory leak in Ape-X

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
Ray installed from (source or binary): binary
Ray version: 0.6.0
Python version: 2.7
Exact command to reproduce: rllib train -f crash.yaml

You can run this on any 64-core CPU machine:

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

Describe the problem

Source code / logs

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 99, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 1010, in main_loop
    self._wait_for_and_process_task(task)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 967, in _wait_for_and_process_task
    self._process_task(task, execution_info)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 865, in _process_task
    traceback_str)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 889, in _handle_process_task_failure
    self._store_outputs_in_object_store(return_object_ids, failure_objects)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 798, in _store_outputs_in_object_store
    self.put_object(object_ids[i], outputs[i])
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 411, in put_object
    self.store_and_register(object_id, value)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 346, in store_and_register
    self.task_driver_id))
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/utils.py", line 404, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 534, in pyarrow._plasma.PlasmaClient.put
    buffer = self.create(target_id, serialized.total_bytes)
  File "pyarrow/_plasma.pyx", line 344, in pyarrow._plasma.PlasmaClient.create
    check_status(self.client.get().Create(object_id.data, data_size,
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
    raise ArrowIOError(message)
ArrowIOError: Broken pipe

  This error is unexpected and should not have happened. Somehow a worker
  crashed in an unanticipated way causing the main_loop to throw an exception,
  which is being caught in "python/ray/workers/default_worker.py".

The rest of the experiment keeps running, but the particular trial fails.

Issue Analytics

State:
Created 5 years ago
Comments:15 (14 by maintainers)

Top GitHub Comments

1reaction

stephanie-wangcommented, Dec 5, 2018

@ericl and I determined that the error messages like The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang. are expected, but we should fix the backend so that the job doesn’t hang. I’m currently working on a PR to treat the task as failed if the object really has been evicted.

0reactions

richardliawcommented, Dec 23, 2018

Oh just kidding this is single node

On Sat, Dec 22, 2018 at 8:55 PM Richard Liaw rich.liaw@gmail.com wrote:

Does that work? I think some small tweaks might be needed (on a similar cluster, I’m getting) Exception: When connecting to an existing cluster, redis_max_memory must not be provided.

On Sat, Dec 22, 2018 at 3:36 PM Eric Liang notifications@github.com wrote:

Note, to re-run this case for QA in the future use this YAML:

cluster_name: ppo min_workers: 0 max_workers: 0 provider: type: aws region: us-east-1 availability_zone: us-east-1a auth: ssh_user: ubuntu head_node: InstanceType: m4.16xlarge ImageId: ami-09edd5690cc795127 worker_nodes: InstanceType: m4.16xlarge ImageId: ami-09edd5690cc795127 file_mounts: “/home/ubuntu/crash.yaml”: “~/Desktop/crash.yaml” head_setup_commands: [] setup_commands: - echo ok - rm -rf /home/ubuntu/.local/lib/python2.7/site-packages/ray - rm -rf /tmp/ray - source activate tensorflow_p27 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.0-cp27-cp27mu-manylinux1_x86_64.whl --user head_start_ray_commands: - source activate tensorflow_p27 && ray stop - source activate tensorflow_p27 && ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml worker_start_ray_commands: - source activate tensorflow_p27 && ray stop - source activate tensorflow_p27 && ulimit -c unlimited && ray start --redis-address=$RAY_HEAD_IP:6379

And this commandline:

ray exec apex.yaml “source activate tensorflow_p27 && rllib train -f crash.yaml --ray-redis-max-memory=5000000000” --start --tmux

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3452#issuecomment-449604203, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5RIGhKkOKag7rie7gLAVaF0Q00EAks5u7sIIgaJpZM4Y9Lv9 .