question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Global CUDA tensor causes fork bomb

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS
  • Ray installed from (source or binary): pip install
  • Ray version: 0.7.3
  • Python version: 3.6.8
  • Exact command to reproduce: CUDA_VISIBLE_DEVICES=0 python forkbomb.py

Describe the problem

Running the following code causes dozens of ray_worker processes to be instantiated on the GPU. Usually, a long stream of CUDA out of memory errors are eventually printed (though sometimes this code doesn’t print any errors):

import ray 
import torch

tensor = torch.ones(1).cuda()

@ray.remote(num_gpus=1.0)
def forkbomb():
    global tensor 
    tensor += 1
    return torch.zeros(1).cuda()

if __name__ == "__main__":
    ray.init(redis_password='forkforkfork')
    print(ray.get(forkbomb.remote()))

CC @kiddyboots216

Source code / logs

Log:

$ CUDA_VISIBLE_DEVICES=0 python forkbomb.py
2019-08-24 04:16:14,353 INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-08-24_04-16-14_352753_5738/logs.
2019-08-24 04:16:14,558 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:16781 to respond...
2019-08-24 04:16:14,784 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:19543 to respond...
2019-08-24 04:16:14,787 INFO services.py:809 -- Starting Redis shard with 10.0 GB max memory.
2019-08-24 04:16:14,989 INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-08-24_04-16-14_352753_5738/logs.
2019-08-24 04:16:14,990 WARNING services.py:1301 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` whe
n calling ray.init() or ray start.
2019-08-24 04:16:14,990 INFO services.py:1475 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-08-24 04:17:04,747 ERROR worker.py:1714 -- Failed to unpickle the remote function '__main__.forkbomb' with function ID c05395f2915bf720b1276f0a06d7aed013c
42c85. Traceback:
Traceback (most recent call last):
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/ray/function_manager.py", line 424, in fetch_and_register_remote_function
    function = pickle.loads(serialized_function)
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/storage.py", line 134, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 386, in load
    return _load(f, map_location, pickle_module, **pickle_load_args)
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 573, in _load
    result = unpickler.load()
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 536, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
    result = fn(storage, location)
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 99, in _cuda_deserialize
    return storage_type(obj.size())
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/cuda/__init__.py", line 615, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

Followed by the exact same error (failed to unpickle, with a CUDA OOM stack trace) many times.

Also this error appears once among the OOM errors:

Traceback (most recent call last):
  File "forkbomb.py", line 14, in <module>
    print(ray.get(forkbomb.remote()))
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/ray/worker.py", line 2247, in get
    raise value
ray.exceptions.RayTaskError: ray_worker:__main__.forkbomb() (pid=35782, host=atlas)
Exception: This function was not imported properly.

If forkbomb() takes an argument (otherwise no change), I get this error instead of the one just above

Traceback (most recent call last):
  File "forkbomb.py", line 15, in <module>
    print(ray.get(forkbomb.remote(False)))
  File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/ray/worker.py", line 2247, in get
    raise value
ray.exceptions.RayTaskError: ray_worker:__main__.forkbomb() (pid=6196, host=atlas)
TypeError: f() takes 0 positional arguments but 1 was given

And then after all the errors, it prints the final answer correctly:

tensor([0.], device='cuda:0')

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Aug 25, 2019

How about we ban serialization of torch Tensors?

On Sat, Aug 24, 2019, 7:39 PM Ashwinee Panda notifications@github.com wrote:

Ray has inconsistent behavior in serializing/deserializing torch tensors. The patch doesn’t fix this issue.

Forcing the user to manually call .cpu().detach().numpy() -> torch.from_numpy().cuda() every time they want to communicate a tensor between CUDA devices doesn’t seem like a good solution.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/5529?email_source=notifications&email_token=AAADUSTNHUCWH6PU7ANALYTQGHWGJA5CNFSM4IPF67U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CLBKQ#issuecomment-524595370, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSR2RG3VTIZ5AQREUCLQGHWGJANCNFSM4IPF67UQ .

0reactions
richardliawcommented, Oct 14, 2020

torch tensor serialization now works.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA Dynamic Parallelism API and Principles
Such a “fork bomb” will consume lots of memory, and may impact performance and even correctness of your application. That wraps up my...
Read more >
Would globally aliasing the fork bomb prevent its execution?
The two, no, three, ... Amongst the main obstacles to that are: It's not a valid name for an alias. Bash's online manual:....
Read more >
Fork() Bomb - GeeksforGeeks
Fork Bomb is a program that harms a system by making it run out of memory. It forks processes infinitely to fill memory....
Read more >
How to prevent fork-bomb in a non global Zone ?
Fork bombs operate both by consuming CPU time in the process of forking, and by saturating the operating system's process table.
Read more >
Untitled
Natox price, Article 137 cpp, Le roy ny tourettes cause, Extron training uk? ... fork bomb protection, Vicomero torrile cap, Prolongamento tvi24 dezembro....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found