Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Should we use ZeroMQ?

See original GitHub issue

I’ve been wrestling with network communication recently as our bottlenecks have become increasingly communication focused. I’d like to get feedback from people like @minrk and @pitrou on whether or not it is worth exploring switching our communication stack over to ZeroMQ.

Our Communication Types

Dask.distributed engages in the following kinds of communication. Everything is point-to-point.

Frequent small messages along long-running connections. We might dump in thousands of 100 byte messages in a second. Currently we batch manually. Presumably ZeroMQ would do this for us allowing us to remove some of the BatchedSend infrastructure.
Peer to peer data movement. These are small-to-large messages (1kb to 10GB) along ephemeral worker-to-worker connections. Sometimes these overlap. Here we’re concerned about not having this communication block or be blocked by other executing code, establishing connections quickly (or else caching connections in a way that respects open file limits), and possibly negotiating multiple messages at once on the same socket.
Periodic small messages for heart beats.

Advantages of ZeroMQ

We get to remove BatchedSend and timeout handling
We get zero-copy for free (though this is probably not that big of a deal)
We maybe get Infiniband for free?
We get pub-sub, though I’m not sure how we would use this currently?

Questions

How does ZeroMQ handle hard failures on the other end of the socket?
We are sometimes bound by open file handles. Does one Request-Reply socket live on one TCP Socket?
We currently suffer a delay of 10ms-100ms to establish TCP connections through Tornado. Is this likely to change with ZeroMQ? (Though in truth this is more likely something we’re doing wrong internally)
Are there opportunities for ZeroMQ to hold on to “closed” connections in the background while still respecting open file limits? (I suppose we could also do this connection pooling in application code)
Assuming that we’re using pyzmq’s Tornado connection, is there anything we should be concerned about if the event loop is blocked for a non-trivial amount of time? Obviously we can’t dump any new messages during this time, but is ZeroMQ handling things nicely in the background in a separate system thread or is it blocked?
If we install pyzmq’s event loop can we still overlap with other Tornado applications like Bokeh?

Implementation

What is the right way to experiment with this? Presumably we change around core.connect, core.read and core.write and we switch out all BatchedSend streams with normal streams. Is there anything else?

We might still want to keep some of our own batching because the serialization and compression parts of our communication pipeline will still benefit.

Issue Analytics

State:
Created 7 years ago
Reactions:4
Comments:35 (34 by maintainers)

Top GitHub Comments

1reaction

minrkcommented, Jan 2, 2017

It’s definitely worth considering, but it’s not a given that it would be a win overall. I’ve been thinking on the IPython / Jupyter side that perhaps we shouldn’t have used zmq, as many of the potential advantages don’t really come up for Jupyter use cases. But it is IPython Parallel where the benefits are greatest.

We maybe get Infiniband for free?

sort of. You don’t get truly native IB, but you can use the TIPC transport. I think most use TCP-over-IB, which you would already have anyway.

We get pub-sub, though I’m not sure how we would use this currently?

Other than instrumentation, I’m unsure how PUB-SUB would work for dask. In IPython Parallel, this is used to funnel output, which I thought was a cool idea enabling things like monitoring services, but I think nobody has ended up using them. It could enable you to do things like explode the Scheduler into a constellation of loosely connected processes, so that things like bokeh / analytics / instrumentation can’t slow down scheduling. This is perhaps the most successful aspect of IPython Parallel’s zmq redesign: ‘schedulers’ are lightweight distributors of messages, and everything heavy happens in a ‘Hub’ process that gets passively notified of everything that passed through the schedulers.

Diagramming message flow patterns is extremely helpful for designing zmq-connected applications, as you have to think a lot about what sockets are right to use where, and there can be several valid options.

How does ZeroMQ handle hard failures on the other end of the socket?

In one sense, it doesn’t. ZeroMQ hides connect/disconnect events, which can be nice, but also frustrating. The main thing it does is ensure that messages are delivered completely or not at all - a message that is incompletely delivered is considered not at all delivered. Depending on the socket type, if you try to send to a peer that is gone, the message will either wait in memory to be sent, or be discarded as undeliverable.

We are sometimes bound by open file handles. Does one Request-Reply socket live on one TCP Socket?

Each zmq socket corresponds to one or more FD. Due to zmq’s internal use of file descriptors for inter-thread signaling, I would expect to hit fd limits sooner, not later, with zmq.

We currently suffer a delay of 10ms-100ms to establish TCP connections through Tornado. Is this likely to change with ZeroMQ?

I wouldn’t think so, unless you use a transport other than tcp. There is one potential difference, in that the connection handshake all happens in a GIL-less IO thread in C++, so that latency may be hidden behind other work going on in Python.

Are there opportunities for ZeroMQ to hold on to “closed” connections in the background while still respecting open file limits?

I’m not quite sure what you mean, but zeromq doesn’t generally keep ‘closed’ connections to be re-used on demand. You would have to use similar explicit connect/close calls to manage your FD usage.

Assuming that we’re using pyzmq’s Tornado connection, is there anything we should be concerned about if the event loop is blocked for a non-trivial amount of time? Obviously we can’t dump any new messages during this time, but is ZeroMQ handling things nicely in the background in a separate system thread or is it blocked?

This situation should be improved with zmq. All of the actual network IO happens in one or more GIL-less C++ threads, so the only part that’s in contest with blocking Python is handing the memory off to libzmq, which is very quick.

A send from Python involves:

Socket.send(msg), with either:
- copy=False: builds a message with a pointer to Python-owned memory (no copies, but more bookkeeping)
- copy=True: makes an in-memory copy of the message data, owned by C++
call zmq_send, which is ~instant, since it is only passing a pointer around
(in io thread) actual network send, which can take arbitrarily long, without any contention from Python
if not copy: IO thread informs Python that it’s done with memory

Most of the work with pyzmq happens in a GIL-less C++ thread. A zmq_send is really handing a pointer to C++, so generally completes extremely quickly, and completes in a length of time independent of the size

If we install pyzmq’s event loop can we still overlap with other Tornado applications like Bokeh?

Yes, absolutely. The Jupyter notebook is a regular tornado webapp that uses the zmq integration for communication with kernels. Call ioloop.install() to tell tornado to use pyzmq’s poller implementation.

We might still want to keep some of our own batching because the serialization and compression parts of our communication pipeline will still benefit.

I think you will want to keep some of this. One thing in particular is sending very large messages. Since zeromq delivers only whole messages, you will likely want to chunk very large messages into multiple zmq sends. I don’t have a good answer for how big those chunks should be, though. This is something Jupyter / IPython parallel doesn’t support, and hasn’t really felt a need for.

IPython / Jupyter’s design makes it relatively simple to swap out transports. It’s been a bit since I looked at distributed’s implementation, but if you have these key abstractions:

a ‘message’ is a sequence of bytes / buffer-providing objects
a central implementation of taking a ‘message’ and sending it to a peer
a central implementation of receiving data and returning / triggering events with a complete message

experimenting with a zmq implementation shouldn’t be too disruptive.

@pitrou’s points:

Tornado integration doesn’t seem to provide a coroutine-friendly API

pyzmq 15 adds a Futures-based API for both tornado and asyncio coroutines.

ZeroMQ doesn’t provide notification of peer disconnect, though it does seem to allow setting the TCP timeout

zeromq 4.2 does allow setting connect timeout, and you can monitor connect/disconnect events, though it’s generally recommended to not use this kind of thing for anything other than debugging.

If using pyzmq’s Tornado integration, it won’t remove the (probably small, but still) overhead of Tornado’s I/O loop, since the integration merely replaces epoll() with zmq_poll()

This is certainly true, and if anything it could increase the overhead slightly. You can use the zmq.FD interface to hook into a faster eventloop such as uv, but this is an edge-triggered FD that I find an absolute nightmare to work with.

1reaction

pitroucommented, Jan 2, 2017

Some notes while reviewing the ZeroMQ / pyzmq docs:

Tornado integration doesn’t seem to provide a coroutine-friendly API (you need to pass a callback that will receive individual messages), though that can be circumvented
If we ever need to implement confidentiality through encryption and authentication, I don’t think ZeroMQ’s crypto support is proven enough compared to TLS (which ZeroMQ doesn’t provide integration with, sadly)
According to its FAQ (which may be outdated), ZeroMQ doesn’t provide notification of peer disconnect, though it does seem to allow setting the TCP timeout (http://api.zeromq.org/3-2:zmq-setsockopt)
If using pyzmq’s Tornado integration, it won’t remove the (probably small, but still) overhead of Tornado’s I/O loop, since the integration merely replaces epoll() with zmq_poll()
In any case, our own protocol layer will remain. I’m not sure which proportion of the overhead it is responsible for; intuitively, I’d expect it to be dominant.

Things ZeroMQ can bring us:

Handling of reconnections, and/or “connecting” to a server before it is listening
Versatility of underlying transports (TCP, local IPC, infiniband (apparently?)…)
Datagram-oriented communication, i.e. no need to look for message boundaries in pure Python
Multipart messages, allegedly zero-copy (which seems to fit our multi-frame model)
More sophisticated communication models than request/response, though they all seem to have their particular sets of limitations which may not allow them to fit our model (e.g. the fact that pubsub messages are not buffered when a subscriber disconnects)

A note about ZeroMQ’s batching: it only waits for the local network interface (or kernel buffer) to be ready for sending, which probably doesn’t mean much unless we’re saturating the network capacity. Our BatchedSend does more, as it has a built-in timer to coalesce messages.