Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tune] RayTune issue with broken pipe while writing to socket

See original GitHub issue

What is the problem?

Failure to begin tuning trials due to broken pipe, with large data files.

Ray version and other system information (Python version, TensorFlow version, OS): ray 1.0.1 torch 1.4.0 python 3 Cloudera workbench

Reproduction (REQUIRED)

Code example used to test here

Failures

ConnectionError: Error 32 while writing to socket. Broken pipe.
BrokenPipeError                           Traceback (most recent call last)
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
    699             for item in command:
--> 700                 sendall(self._sock, item)
    701         except socket.timeout:
/home/cdsw/.local/lib/python3.6/site-packages/redis/_compat.py in sendall(sock, *args, **kwargs)
      7 def sendall(sock, *args, **kwargs):
----> 8     return sock.sendall(*args, **kwargs)
      9 
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
ConnectionError                           Traceback (most recent call last)
in engine
     10     scheduler=pbt,
     11     reuse_actors=False,
---> 12     resume=False,
     13 )
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, loggers, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
    319                 export_formats=export_formats,
    320                 max_failures=max_failures,
--> 321                 restore=restore)
    322     else:
    323         logger.debug("Ignoring some parameters passed into tune.run.")
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/experiment.py in __init__(self, name, run, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, trial_dirname_creator, loggers, log_to_file, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, export_formats, max_failures, restore)
    136                     "checkpointable function. You can specify checkpoints "
    137                     "within your trainable function.")
--> 138         self._run_identifier = Experiment.register_if_needed(run)
    139         self.name = name or self._run_identifier
    140 
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/experiment.py in register_if_needed(cls, run_object)
    274                     "No name detected on trainable. Using {}.".format(name))
    275             try:
--> 276                 register_trainable(name, run_object)
    277             except (TypeError, PicklingError) as e:
    278                 msg = (
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in register_trainable(name, trainable, warn)
     69         raise TypeError("Second argument must be convertable to Trainable",
     70                         trainable)
---> 71     _global_registry.register(TRAINABLE_CLASS, name, trainable)
     72 
     73 
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in register(self, category, key, value)
    122         self._to_flush[(category, key)] = pickle.dumps(value)
    123         if _internal_kv_initialized():
--> 124             self.flush_values()
    125 
    126     def contains(self, category, key):
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in flush_values(self)
    144     def flush_values(self):
    145         for (category, key), value in self._to_flush.items():
--> 146             _internal_kv_put(_make_key(category, key), value, overwrite=True)
    147         self._to_flush.clear()
    148 
/home/cdsw/.local/lib/python3.6/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
     25 
     26     if overwrite:
---> 27         updated = worker.redis_client.hset(key, "value", value)
     28     else:
     29         updated = worker.redis_client.hsetnx(key, "value", value)
/home/cdsw/.local/lib/python3.6/site-packages/redis/client.py in hset(self, name, key, value)
   3002         Returns 1 if HSET created a new field, otherwise 0
   3003         """
-> 3004         return self.execute_command('HSET', name, key, value)
   3005 
   3006     def hsetnx(self, name, key, value):
/home/cdsw/.local/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
    875         conn = self.connection or pool.get_connection(command_name, **options)
    876         try:
--> 877             conn.send_command(*args)
    878             return self.parse_response(conn, command_name, **options)
    879         except (ConnectionError, TimeoutError) as e:
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args, **kwargs)
    719         "Pack and send a command to the Redis server"
    720         self.send_packed_command(self.pack_command(*args),
--> 721                                  check_health=kwargs.get('check_health', True))
    722 
    723     def can_read(self, timeout=0):
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
    711                 errmsg = e.args[1]
    712             raise ConnectionError("Error %s while writing to socket. %s." %
--> 713                                   (errno, errmsg))
    714         except:  # noqa: E722
    715             self.disconnect()
ConnectionError: Error 32 while writing to socket. Broken pipe.
Engine exited with status 1.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

nihirvcommented, Feb 17, 2021

I’m getting the same issue (Error 32). After making changes based on what the handling large datasets page says, my error changed to: redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

I saw another github thread which was similar #2931. One comment on that issue (https://github.com/ray-project/ray/issues/2931#issuecomment-450418923) says to increase the _redis_max_memory. After performing that, I am back to an Error 32

1reaction

richardliawcommented, Nov 21, 2020

@meechos have you taken a look at this?

https://docs.ray.io/en/master/tune/user-guide.html#handling-large-datasets

I think the problem is that you’re forcing the objects to be serialized and transferred through Redis (while instead you should use the Ray Object Store).

Top Results From Across the Web

Connection reset for large data, already using "tune ... - Ray

ConnectionError: Error 32 while writing to socket. Broken pipe. my train func is as follows def train_model(config, data=None, checkpoint_dir=None): ...

What kind of exception does Python 3.2 throw in case of [Errno ...

I try to patch a library to catch [Errno 32] Broken pipe . The library is coded to run in Python 2 and...

Bug #1364685 “VMware: Broken pipe ERROR when boot VM”

This error happens intermittently, but always can be reproduced after long run and have multiple vmware computer connect to the same vCenter ...

Broken Pipe After Writing To Socket - ADocLib

What is broken pipe error? 2.Write error: the message Broken channel means that the recording process is trying to write to the channel...

broken-pipe-error-with-multiprocessing-queue - You.com

That socketserver simplifies the task of writing network servers means it does certain things automatically which are usually done in the course of...