question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0

See original GitHub issue

System information

  • OS Platform and Distribution: Ubuntu 16.04, Ubuntu 18.04, Amazon EC2 optimized Linux
  • Ray installed from (source or binary): pip
  • Ray version: 0.6.0
  • Python version: 3.7 & 3.6

A minimal example to reproduce:

import ray
import torch


class NeuralNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l = torch.nn.Linear(2, 2)

    def forward(self, x):
        return self.l(x)


@ray.remote
class TestActor:
    def __init__(self):
        self.net = NeuralNet()
        self.crit = torch.nn.MSELoss()

    def train(self):
        p = self.net(torch.tensor([[1.0, 2.0]]))
        loss = self.crit(p, torch.tensor([[3.0, 4.0]]))
        self.net.zero_grad()
        loss.backward()
        return loss.item()


if __name__ == '__main__':
    ray.init()
    ac = TestActor.remote()
    print(ray.get(ac.train.remote()))

Problem Description

I clean installed PyTorch with conda on different OS and don’t get Seg faults when not using ray. This may be a misunderstanding on my part as this seems like an issue that others would have met minutes after it was introduced… Isn’t this how ray is supposed to be used when running locally?

Console Output

Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-12_10-51-32_500/logs.
Waiting for redis server at 127.0.0.1:31236 to respond...
Waiting for redis server at 127.0.0.1:56879 to respond...
Starting the Plasma object store with 13.470806835000001 GB memory using /dev/shm.
Failed to start the UI, you may need to run 'pip install jupyter'.
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/torch/tensor.py", line 93 in backward
  File "/home/eric/PycharmProjects/rayTorchTest/test.py", line 25 in train
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/function_manager.py", line 481 in actor_method_executor
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 856 in _process_task
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 967 in _wait_for_and_process_task
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 1010 in main_loop
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/workers/default_worker.py", line 99 in <module>
A worker died or was killed while executing task 00000000d024403f9ddae404df35ac4a32625560.
Traceback (most recent call last):
  File "/home/eric/PycharmProjects/rayTorchTest/test.py", line 32, in <module>
    print(ray.get(ac.train.remote()))
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 2366, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000d024403f9ddae404df35ac4a32625560). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Process finished with exit code 1

Edit 1:

The same script runs successfully with the following dependencies:

python==3.6.0
torch==0.4.1
ray==0.4.0
redis==2.10.6

Upgrading to ray 0.5.3 throws another error that looks very similar (worker died) although it doesn’t state that a segfault occurred. Upgrading to ray 0.6.0 causes the above-demonstrated segfault.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:17 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
pcmoritzcommented, Dec 18, 2018

Thanks for the patience, I created a version of Ray that should work now (I could reproduce your crash and it is not crashing any more with the fix): https://drive.google.com/open?id=1LLjYaqysbYg1Gz3RO91o3dqX77VF8u3g (this is built off https://github.com/pcmoritz/ray-1/commit/f8b75efc223bd306c1867effa468a4ee961839a4, the branch is a little messy right now and I’ll clean it up).

As a high level, the fix is to remove std::future from arrow and replace it with boost::future. This is a temporary workaround until we have figured out https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion, which is the root cause of all of this).

Feel free to try it out if you want to and report back, I’ll create a PR to be included in 0.6.1.

0reactions
robertnishiharacommented, Dec 24, 2018

Fixed by #3574.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault
Every time at 95-99% of first epoch, the system crashed with little information (Segmentation fault). I am sure the GPU and CPU memory...
Read more >
DenseNet121 transplanting using TensorRT - Jetson Nano
Hi, I need do preprocess of the input image before TensorRT flow, and need do postprocess for output of the neural network after...
Read more >
image_segmentation_extended_...
Image segmentation using Detectron2¶. Using Detectron2 and mask R-CNN it is possible to isolate each leaf from the input image and extract the...
Read more >
Package List — Spack 0.20.0.dev0 documentation
nf-tower-agent, py-segmentation-models-pytorch, xineramaproto ... Versions: master, 1.0.0-lbann, 1.0.0, 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.3, 0.3.2, 0.2.1-1, ...
Read more >
conda-forge
aiida-tools, 0.4.1, Apache-2.0, X, Helper tools for developing AiiDA plugins. ... aiomcache, 0.6.0, BSD, X, Minimal pure python memcached client.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found