question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tune] RayTune issue with broken pipe while writing to socket

See original GitHub issue

What is the problem?

Failure to begin tuning trials due to broken pipe, with large data files.

Ray version and other system information (Python version, TensorFlow version, OS): ray 1.0.1 torch 1.4.0 python 3 Cloudera workbench

Reproduction (REQUIRED)

Code example used to test here

Failures

ConnectionError: Error 32 while writing to socket. Broken pipe.
BrokenPipeError                           Traceback (most recent call last)
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
    699             for item in command:
--> 700                 sendall(self._sock, item)
    701         except socket.timeout:
/home/cdsw/.local/lib/python3.6/site-packages/redis/_compat.py in sendall(sock, *args, **kwargs)
      7 def sendall(sock, *args, **kwargs):
----> 8     return sock.sendall(*args, **kwargs)
      9 
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
ConnectionError                           Traceback (most recent call last)
in engine
     10     scheduler=pbt,
     11     reuse_actors=False,
---> 12     resume=False,
     13 )
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, loggers, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint)
    319                 export_formats=export_formats,
    320                 max_failures=max_failures,
--> 321                 restore=restore)
    322     else:
    323         logger.debug("Ignoring some parameters passed into tune.run.")
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/experiment.py in __init__(self, name, run, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, trial_dirname_creator, loggers, log_to_file, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, export_formats, max_failures, restore)
    136                     "checkpointable function. You can specify checkpoints "
    137                     "within your trainable function.")
--> 138         self._run_identifier = Experiment.register_if_needed(run)
    139         self.name = name or self._run_identifier
    140 
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/experiment.py in register_if_needed(cls, run_object)
    274                     "No name detected on trainable. Using {​​​​​​​​}​​​​​​​​.".format(name))
    275             try:
--> 276                 register_trainable(name, run_object)
    277             except (TypeError, PicklingError) as e:
    278                 msg = (
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in register_trainable(name, trainable, warn)
     69         raise TypeError("Second argument must be convertable to Trainable",
     70                         trainable)
---> 71     _global_registry.register(TRAINABLE_CLASS, name, trainable)
     72 
     73 
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in register(self, category, key, value)
    122         self._to_flush[(category, key)] = pickle.dumps(value)
    123         if _internal_kv_initialized():
--> 124             self.flush_values()
    125 
    126     def contains(self, category, key):
/home/cdsw/.local/lib/python3.6/site-packages/ray/tune/registry.py in flush_values(self)
    144     def flush_values(self):
    145         for (category, key), value in self._to_flush.items():
--> 146             _internal_kv_put(_make_key(category, key), value, overwrite=True)
    147         self._to_flush.clear()
    148 
/home/cdsw/.local/lib/python3.6/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
     25 
     26     if overwrite:
---> 27         updated = worker.redis_client.hset(key, "value", value)
     28     else:
     29         updated = worker.redis_client.hsetnx(key, "value", value)
/home/cdsw/.local/lib/python3.6/site-packages/redis/client.py in hset(self, name, key, value)
   3002         Returns 1 if HSET created a new field, otherwise 0
   3003         """
-> 3004         return self.execute_command('HSET', name, key, value)
   3005 
   3006     def hsetnx(self, name, key, value):
/home/cdsw/.local/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
    875         conn = self.connection or pool.get_connection(command_name, **options)
    876         try:
--> 877             conn.send_command(*args)
    878             return self.parse_response(conn, command_name, **options)
    879         except (ConnectionError, TimeoutError) as e:
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args, **kwargs)
    719         "Pack and send a command to the Redis server"
    720         self.send_packed_command(self.pack_command(*args),
--> 721                                  check_health=kwargs.get('check_health', True))
    722 
    723     def can_read(self, timeout=0):
/home/cdsw/.local/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command, check_health)
    711                 errmsg = e.args[1]
    712             raise ConnectionError("Error %s while writing to socket. %s." %
--> 713                                   (errno, errmsg))
    714         except:  # noqa: E722
    715             self.disconnect()
ConnectionError: Error 32 while writing to socket. Broken pipe.
Engine exited with status 1.
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
nihirvcommented, Feb 17, 2021

I’m getting the same issue (Error 32). After making changes based on what the handling large datasets page says, my error changed to: redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

I saw another github thread which was similar #2931. One comment on that issue (https://github.com/ray-project/ray/issues/2931#issuecomment-450418923) says to increase the _redis_max_memory. After performing that, I am back to an Error 32

1reaction
richardliawcommented, Nov 21, 2020

@meechos have you taken a look at this?

https://docs.ray.io/en/master/tune/user-guide.html#handling-large-datasets

I think the problem is that you’re forcing the objects to be serialized and transferred through Redis (while instead you should use the Ray Object Store).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Connection reset for large data, already using "tune ... - Ray
ConnectionError: Error 32 while writing to socket. Broken pipe. my train func is as follows def train_model(config, data=None, checkpoint_dir=None): ...
Read more >
What kind of exception does Python 3.2 throw in case of [Errno ...
I try to patch a library to catch [Errno 32] Broken pipe . The library is coded to run in Python 2 and...
Read more >
Bug #1364685 “VMware: Broken pipe ERROR when boot VM”
This error happens intermittently, but always can be reproduced after long run and have multiple vmware computer connect to the same vCenter ...
Read more >
Broken Pipe After Writing To Socket - ADocLib
What is broken pipe error? 2.Write error: the message Broken channel means that the recording process is trying to write to the channel...
Read more >
broken-pipe-error-with-multiprocessing-queue - You.com
That socketserver simplifies the task of writing network servers means it does certain things automatically which are usually done in the course of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found