question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible memory leak in Ape-X

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.0
  • Python version: 2.7
  • Exact command to reproduce: rllib train -f crash.yaml

You can run this on any 64-core CPU machine:

crash.yaml:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
        num_gpus: 0

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
        sample_batch_size: 20
        train_batch_size: 1
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

Describe the problem

Source code / logs

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/workers/default_worker.py", line 99, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 1010, in main_loop
    self._wait_for_and_process_task(task)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 967, in _wait_for_and_process_task
    self._process_task(task, execution_info)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 865, in _process_task
    traceback_str)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 889, in _handle_process_task_failure
    self._store_outputs_in_object_store(return_object_ids, failure_objects)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 798, in _store_outputs_in_object_store
    self.put_object(object_ids[i], outputs[i])
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 411, in put_object
    self.store_and_register(object_id, value)
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/worker.py", line 346, in store_and_register
    self.task_driver_id))
  File "/home/ubuntu/.local/lib/python2.7/site-packages/ray/utils.py", line 404, in _wrapper
    return orig_attr(*args, **kwargs)
  File "pyarrow/_plasma.pyx", line 534, in pyarrow._plasma.PlasmaClient.put
    buffer = self.create(target_id, serialized.total_bytes)
  File "pyarrow/_plasma.pyx", line 344, in pyarrow._plasma.PlasmaClient.create
    check_status(self.client.get().Create(object_id.data, data_size,
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
    raise ArrowIOError(message)
ArrowIOError: Broken pipe

  This error is unexpected and should not have happened. Somehow a worker
  crashed in an unanticipated way causing the main_loop to throw an exception,
  which is being caught in "python/ray/workers/default_worker.py".
  

The rest of the experiment keeps running, but the particular trial fails.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
stephanie-wangcommented, Dec 5, 2018

@ericl and I determined that the error messages like The output of an actor task is required, but the actor may still be alive. If the output has been evicted, the job may hang. are expected, but we should fix the backend so that the job doesn’t hang. I’m currently working on a PR to treat the task as failed if the object really has been evicted.

0reactions
richardliawcommented, Dec 23, 2018

Oh just kidding this is single node

On Sat, Dec 22, 2018 at 8:55 PM Richard Liaw rich.liaw@gmail.com wrote:

Does that work? I think some small tweaks might be needed (on a similar cluster, I’m getting) Exception: When connecting to an existing cluster, redis_max_memory must not be provided.

On Sat, Dec 22, 2018 at 3:36 PM Eric Liang notifications@github.com wrote:

Note, to re-run this case for QA in the future use this YAML:

cluster_name: ppo min_workers: 0 max_workers: 0 provider: type: aws region: us-east-1 availability_zone: us-east-1a auth: ssh_user: ubuntu head_node: InstanceType: m4.16xlarge ImageId: ami-09edd5690cc795127 worker_nodes: InstanceType: m4.16xlarge ImageId: ami-09edd5690cc795127 file_mounts: “/home/ubuntu/crash.yaml”: “~/Desktop/crash.yaml” head_setup_commands: [] setup_commands: - echo ok - rm -rf /home/ubuntu/.local/lib/python2.7/site-packages/ray - rm -rf /tmp/ray - source activate tensorflow_p27 && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.6.0-cp27-cp27mu-manylinux1_x86_64.whl --user head_start_ray_commands: - source activate tensorflow_p27 && ray stop - source activate tensorflow_p27 && ulimit -c unlimited && ray start --head --redis-port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml worker_start_ray_commands: - source activate tensorflow_p27 && ray stop - source activate tensorflow_p27 && ulimit -c unlimited && ray start --redis-address=$RAY_HEAD_IP:6379

And this commandline:

ray exec apex.yaml “source activate tensorflow_p27 && rllib train -f crash.yaml --ray-redis-max-memory=5000000000” --start --tmux

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/3452#issuecomment-449604203, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5RIGhKkOKag7rie7gLAVaF0Q00EAks5u7sIIgaJpZM4Y9Lv9 .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Apex Legends: FPS Drop, Memory Leak. - Answers HQ
I went to Task Manager and saw that there's a possible Memory-Leak for this game when started. Anyway to fix, please help. Graphics...
Read more >
Vram Memory leak : r/apexlegends - Reddit
"NVIDIA Reflex ENABLED+BOOST" Is the cause of Vram Memory leak. If it doesn't work ;. GeForce Experience --> APEX Legends (DETILS) --> Custom ......
Read more >
Identify and mitigate memory leaks - Salesforce Help
This article will help clarify what information will be needed by Salesforce Support to troubleshoot the issue.
Read more >
Possible Memory leak when playing for some hours | Apex ...
Hey @T732232,. Since you're running 8GB of RAM a 12GB page file makes sense. The rule of thumb is generally 1.5 times your...
Read more >
GPU memory issues (leak?) · Issue #439 · NVIDIA/apex - GitHub
The GPU memory accumulates and after a few steps in the loop CUDA memory runs out. I have debugged everything, have monitored memory,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found