question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray start crashes due to redis failing to start

See original GitHub issue

System information

  • OS Platform and Distribution: CentOS Linux release 7.4.1708 (Core)
  • Ray installed from: pip install -U ray[debug]
  • Ray version: 0.7.6
  • Python version: 3.6.9
  • Exact command to reproduce: ray start --head; import ray; ray.init()

Describe the problem

I’ll preface by saying a) thanks in advance for any help and b) this issue surfaced on an HPC cluster so it’s possible there are some non-standard things about the cluster configuration. And I was able to get ray installed by building from source, so there is a workaround.

In short, pip-installed ray fails to launch the redis server and so crashes immediately. My hunch is that the subprocess call to redis-server is failing but I haven’t been able to reproduce this at the command line, or get more verbose exception info from services.py. Log files are unfortunately empty so I can only provide the output from runtime (see below).

Source code / logs

Installation:

conda create -n ray python=3.6  # 3.6 for compatibility with other things
pip install -U ray[debug]   # also tried just "ray"

Reproducing error:

$ ray start --head --temp-dir=$LOCAL_SCRATCH
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-11-12 10:04:23,529	INFO scripts.py:303 -- Using IP address 10.148.0.29 for this node.
2019-11-12 10:04:23,542	INFO resource_spec.py:205 -- Starting Ray with 62.16 GiB memory available for workers and up to 18.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2019-11-12 10:04:23,648	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:23,751	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:23,853	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:23,956	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,058	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,161	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,263	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,366	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,468	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,571	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,673	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,776	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,879	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:24,981	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:25,084	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:25,186	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:25,289	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:25,391	WARNING services.py:822 -- Redis failed to start, retrying now.
2019-11-12 10:04:25,493	WARNING services.py:822 -- Redis failed to start, retrying now.
Traceback (most recent call last):
  File "/home/dbiagion/.conda-envs/ray-test/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/scripts/scripts.py", line 808, in main
    return cli()
  File "/home/dbiagion/.local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/dbiagion/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/dbiagion/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dbiagion/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dbiagion/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/scripts/scripts.py", line 314, in start
    node = ray.node.Node(ray_params, head=True, shutdown_at_exit=block)
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/node.py", line 149, in __init__
    self.start_head_processes()
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/node.py", line 571, in start_head_processes
    self.start_redis()
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/node.py", line 426, in start_redis
    include_java=self._ray_params.include_java)
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/services.py", line 660, in start_redis
    stderr_file=redis_stderr_file)
  File "/home/dbiagion/.conda-envs/ray-test/lib/python3.6/site-packages/ray/services.py", line 846, in _start_redis_instance
    stdout_file.name, stderr_file.name))
Exception: Couldn't start Redis. Check log files: /tmp/scratch/session_2019-11-12_10-04-23_529834_390815/logs/redis.out /tmp/scratch/session_2019-11-12_10-04-23_529834_390815/logs/redis.err

Empty log files:

$ cat /tmp/scratch/session_2019-11-12_10-04-23_529834_390815/logs/redis.err
$ cat /tmp/scratch/session_2019-11-12_10-04-23_529834_390815/logs/redis.out

Thank you!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

6reactions
davebiagionicommented, Dec 12, 2019

We figured out what was happening here. In case anyone else is running on a cluster having similar issues, we traced the issue to a library called libxalt_init.so used for program monitoring. For whatever reason, this library causes redis binary to segfault when it’s on the LD path. Our fix was to unset the variable enabling this library:

unset LD_PRELOAD

I can imagine the libxalt library may live on different paths for different clusters, but hopefully this will get someone pointed in the right direction if encountering a similar issue!

0reactions
davebiagionicommented, Nov 14, 2019

Thanks for the reply.

  1. ray stop (seems ok)
  2. ray start --head yields the same error above, Redis failed to start, retrying now.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot launch ray crash course notebook, ray.init fails ...
I am having trouble to getting start with the Ray crash course from https://github.com/anyscale/academy.
Read more >
Ray/Redis failure on ray.init. Any ideas? - Stack Overflow
Ray drivers are expected to run on a node in the cluster (usually the head node) and requires many ports which you probably...
Read more >
redis-server crashes with === ASSERTION FAILED === when ...
Issue 165 in redis: redis-server crashes with === ASSERTION FAILED === when ... 1. created script to open connections to redis, but not...
Read more >
Redis Crashes - <antirez>
Redis crashes === Redis users are not likely to see Redis crashing ... crash reports that are actually due to memory errors, I'm...
Read more >
Bug listing with status RESOLVED with resolution TEST ...
1.4.1) segmentation fault'ed while starting up" status:RESOLVED resolution:TEST-REQUEST ... Bug:46852 - "Irssi 0.8.9 wont compile due to link error.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found