Global CUDA tensor causes fork bomb
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS
- Ray installed from (source or binary): pip install
- Ray version: 0.7.3
- Python version: 3.6.8
- Exact command to reproduce: CUDA_VISIBLE_DEVICES=0 python forkbomb.py
Describe the problem
Running the following code causes dozens of ray_worker processes to be instantiated on the GPU. Usually, a long stream of CUDA out of memory errors are eventually printed (though sometimes this code doesn’t print any errors):
import ray
import torch
tensor = torch.ones(1).cuda()
@ray.remote(num_gpus=1.0)
def forkbomb():
global tensor
tensor += 1
return torch.zeros(1).cuda()
if __name__ == "__main__":
ray.init(redis_password='forkforkfork')
print(ray.get(forkbomb.remote()))
Source code / logs
Log:
$ CUDA_VISIBLE_DEVICES=0 python forkbomb.py
2019-08-24 04:16:14,353 INFO node.py:498 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-08-24_04-16-14_352753_5738/logs.
2019-08-24 04:16:14,558 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:16781 to respond...
2019-08-24 04:16:14,784 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:19543 to respond...
2019-08-24 04:16:14,787 INFO services.py:809 -- Starting Redis shard with 10.0 GB max memory.
2019-08-24 04:16:14,989 INFO node.py:512 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-08-24_04-16-14_352753_5738/logs.
2019-08-24 04:16:14,990 WARNING services.py:1301 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` whe
n calling ray.init() or ray start.
2019-08-24 04:16:14,990 INFO services.py:1475 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-08-24 04:17:04,747 ERROR worker.py:1714 -- Failed to unpickle the remote function '__main__.forkbomb' with function ID c05395f2915bf720b1276f0a06d7aed013c
42c85. Traceback:
Traceback (most recent call last):
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/ray/function_manager.py", line 424, in fetch_and_register_remote_function
function = pickle.loads(serialized_function)
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/storage.py", line 134, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 386, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 573, in _load
result = unpickler.load()
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 536, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 119, in default_restore_location
result = fn(storage, location)
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 99, in _cuda_deserialize
return storage_type(obj.size())
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/torch/cuda/__init__.py", line 615, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
Followed by the exact same error (failed to unpickle, with a CUDA OOM stack trace) many times.
Also this error appears once among the OOM errors:
Traceback (most recent call last):
File "forkbomb.py", line 14, in <module>
print(ray.get(forkbomb.remote()))
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/ray/worker.py", line 2247, in get
raise value
ray.exceptions.RayTaskError: ray_worker:__main__.forkbomb() (pid=35782, host=atlas)
Exception: This function was not imported properly.
If forkbomb() takes an argument (otherwise no change), I get this error instead of the one just above
Traceback (most recent call last):
File "forkbomb.py", line 15, in <module>
print(ray.get(forkbomb.remote(False)))
File "/data/drothchild/virtualenvs/pytorch/lib/python3.6/site-packages/ray/worker.py", line 2247, in get
raise value
ray.exceptions.RayTaskError: ray_worker:__main__.forkbomb() (pid=6196, host=atlas)
TypeError: f() takes 0 positional arguments but 1 was given
And then after all the errors, it prints the final answer correctly:
tensor([0.], device='cuda:0')
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
CUDA Dynamic Parallelism API and Principles
Such a “fork bomb” will consume lots of memory, and may impact performance and even correctness of your application. That wraps up my...
Read more >Would globally aliasing the fork bomb prevent its execution?
The two, no, three, ... Amongst the main obstacles to that are: It's not a valid name for an alias. Bash's online manual:....
Read more >Fork() Bomb - GeeksforGeeks
Fork Bomb is a program that harms a system by making it run out of memory. It forks processes infinitely to fill memory....
Read more >How to prevent fork-bomb in a non global Zone ?
Fork bombs operate both by consuming CPU time in the process of forking, and by saturating the operating system's process table.
Read more >Untitled
Natox price, Article 137 cpp, Le roy ny tourettes cause, Extron training uk? ... fork bomb protection, Vicomero torrile cap, Prolongamento tvi24 dezembro....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
How about we ban serialization of torch Tensors?
On Sat, Aug 24, 2019, 7:39 PM Ashwinee Panda notifications@github.com wrote:
torch tensor serialization now works.