[Tune] RayTune issue with broken pipe while writing to socket
See original GitHub issueWhat is the problem?
Failure to begin tuning trials due to broken pipe, with large data files.
Ray version and other system information (Python version, TensorFlow version, OS): ray 1.0.1 torch 1.4.0 python 3 Cloudera workbench
Reproduction (REQUIRED)
Code example used to test here
Failures
ConnectionError: Error 32 while writing to socket. Broken pipe.
BrokenPipeError Traceback (most recent call last)
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
699 for item in command:
--> 700 sendall(self._sock, item)
701 except socket.timeout:
/home/cdsw/.local/lib/python3.6/site-packages/redis/_compat.py in sendall(sock, *args, **kwargs)
7 def sendall(sock, *args, **kwargs):
----> 8 return sock.sendall(*args, **kwargs)
9
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
in engine
10 scheduler=pbt,
11 reuse_actors=False,
---> 12 resume=False,
13 )
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, loggers, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
319 export_formats=export_formats,
320 max_failures=max_failures,
--> 321 restore=restore)
322 else:
323 logger.debug("Ignoring some parameters passed into tune.run.")
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/experiment.py in __init__(self, name, run, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, trial_dirname_creator, loggers, log_to_file, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, export_formats, max_failures, restore)
136 "checkpointable function. You can specify checkpoints "
137 "within your trainable function.")
--> 138 self._run_identifier = Experiment.register_if_needed(run)
139 self.name = name or self._run_identifier
140
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/experiment.py in register_if_needed(cls, run_object)
274 "No name detected on trainable. Using {}.".format(name))
275 try:
--> 276 register_trainable(name, run_object)
277 except (TypeError, PicklingError) as e:
278 msg = (
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in register_trainable(name, trainable, warn)
69 raise TypeError("Second argument must be convertable to Trainable",
70 trainable)
---> 71 _global_registry.register(TRAINABLE_CLASS, name, trainable)
72
73
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in register(self, category, key, value)
122 self._to_flush[(category, key)] = pickle.dumps(value)
123 if _internal_kv_initialized():
--> 124 self.flush_values()
125
126 def contains(self, category, key):
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in flush_values(self)
144 def flush_values(self):
145 for (category, key), value in self._to_flush.items():
--> 146 _internal_kv_put(_make_key(category, key), value, overwrite=True)
147 self._to_flush.clear()
148
/home/cdsw/.local/lib/python3.6/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
25
26 if overwrite:
---> 27 updated = worker.redis_client.hset(key, "value", value)
28 else:
29 updated = worker.redis_client.hsetnx(key, "value", value)
/home/cdsw/.local/lib/python3.6/site-packages/redis/client.py in hset(self, name, key, value)
3002 Returns 1 if HSET created a new field, otherwise 0
3003 """
-> 3004 return self.execute_command('HSET', name, key, value)
3005
3006 def hsetnx(self, name, key, value):
/home/cdsw/.local/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
875 conn = self.connection or pool.get_connection(command_name, **options)
876 try:
--> 877 conn.send_command(*args)
878 return self.parse_response(conn, command_name, **options)
879 except (ConnectionError, TimeoutError) as e:
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args, **kwargs)
719 "Pack and send a command to the Redis server"
720 self.send_packed_command(self.pack_command(*args),
--> 721 check_health=kwargs.get('check_health', True))
722
723 def can_read(self, timeout=0):
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
711 errmsg = e.args[1]
712 raise ConnectionError("Error %s while writing to socket. %s." %
--> 713 (errno, errmsg))
714 except: # noqa: E722
715 self.disconnect()
ConnectionError: Error 32 while writing to socket. Broken pipe.
Engine exited with status 1.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Connection reset for large data, already using "tune ... - Ray
ConnectionError: Error 32 while writing to socket. Broken pipe. my train func is as follows def train_model(config, data=None, checkpoint_dir=None): ...
Read more >What kind of exception does Python 3.2 throw in case of [Errno ...
I try to patch a library to catch [Errno 32] Broken pipe . The library is coded to run in Python 2 and...
Read more >Bug #1364685 “VMware: Broken pipe ERROR when boot VM”
This error happens intermittently, but always can be reproduced after long run and have multiple vmware computer connect to the same vCenter ...
Read more >Broken Pipe After Writing To Socket - ADocLib
What is broken pipe error? 2.Write error: the message Broken channel means that the recording process is trying to write to the channel...
Read more >broken-pipe-error-with-multiprocessing-queue - You.com
That socketserver simplifies the task of writing network servers means it does certain things automatically which are usually done in the course of...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I’m getting the same issue (Error 32). After making changes based on what the handling large datasets page says, my error changed to:
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.I saw another github thread which was similar #2931. One comment on that issue (https://github.com/ray-project/ray/issues/2931#issuecomment-450418923) says to increase the
_redis_max_memory. After performing that, I am back to an Error 32@meechos have you taken a look at this?
https://docs.ray.io/en/master/tune/user-guide.html#handling-large-datasets
I think the problem is that you’re forcing the objects to be serialized and transferred through Redis (while instead you should use the Ray Object Store).