Should we use ZeroMQ?
See original GitHub issueI’ve been wrestling with network communication recently as our bottlenecks have become increasingly communication focused. I’d like to get feedback from people like @minrk and @pitrou on whether or not it is worth exploring switching our communication stack over to ZeroMQ.
Our Communication Types
Dask.distributed engages in the following kinds of communication. Everything is point-to-point.
- Frequent small messages along long-running connections. We might dump in thousands of 100 byte messages in a second. Currently we batch manually. Presumably ZeroMQ would do this for us allowing us to remove some of the
BatchedSend
infrastructure. - Peer to peer data movement. These are small-to-large messages (1kb to 10GB) along ephemeral worker-to-worker connections. Sometimes these overlap. Here we’re concerned about not having this communication block or be blocked by other executing code, establishing connections quickly (or else caching connections in a way that respects open file limits), and possibly negotiating multiple messages at once on the same socket.
- Periodic small messages for heart beats.
Advantages of ZeroMQ
- We get to remove
BatchedSend
and timeout handling - We get zero-copy for free (though this is probably not that big of a deal)
- We maybe get Infiniband for free?
- We get pub-sub, though I’m not sure how we would use this currently?
Questions
- How does ZeroMQ handle hard failures on the other end of the socket?
- We are sometimes bound by open file handles. Does one Request-Reply socket live on one TCP Socket?
- We currently suffer a delay of 10ms-100ms to establish TCP connections through Tornado. Is this likely to change with ZeroMQ? (Though in truth this is more likely something we’re doing wrong internally)
- Are there opportunities for ZeroMQ to hold on to “closed” connections in the background while still respecting open file limits? (I suppose we could also do this connection pooling in application code)
- Assuming that we’re using pyzmq’s Tornado connection, is there anything we should be concerned about if the event loop is blocked for a non-trivial amount of time? Obviously we can’t dump any new messages during this time, but is ZeroMQ handling things nicely in the background in a separate system thread or is it blocked?
- If we install pyzmq’s event loop can we still overlap with other Tornado applications like Bokeh?
Implementation
What is the right way to experiment with this? Presumably we change around core.connect
, core.read
and core.write
and we switch out all BatchedSend
streams with normal streams. Is there anything else?
We might still want to keep some of our own batching because the serialization and compression parts of our communication pipeline will still benefit.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:4
- Comments:35 (34 by maintainers)
Top Results From Across the Web
What are zeromq use cases? - Stack Overflow
ZeroMQ's use cases are anything that connects moving pieces on a network. Any kind of distributed system. Since the footprint to add ZeroMQ...
Read more >Why we love and use ZeroMQ at HootSuite - Medium
I mentioned that ZeroMQ is very fast but it also provides other benefits: It not only gives us traditional client/server communication pattern but...
Read more >A Brief Introduction to ZeroMQ | Technical Software Blog
ZeroMQ is an asynchronous network messaging library known for its high performance. It's intended use is for distributed systems as well as ...
Read more >Why We Needed ØMQ - ZeroMQ [Book] - O'Reilly
It reduces your carbon footprint. Doing more with less CPU means your boxes use less power, and you can keep your old boxes...
Read more >any real reason you should use zeromq over rabbitmq?
Yep, it's best to think of ZeroMQ as networking library than an "MQ". It makes communication between processes easier than writing BSD sockets...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It’s definitely worth considering, but it’s not a given that it would be a win overall. I’ve been thinking on the IPython / Jupyter side that perhaps we shouldn’t have used zmq, as many of the potential advantages don’t really come up for Jupyter use cases. But it is IPython Parallel where the benefits are greatest.
sort of. You don’t get truly native IB, but you can use the TIPC transport. I think most use TCP-over-IB, which you would already have anyway.
Other than instrumentation, I’m unsure how PUB-SUB would work for dask. In IPython Parallel, this is used to funnel output, which I thought was a cool idea enabling things like monitoring services, but I think nobody has ended up using them. It could enable you to do things like explode the Scheduler into a constellation of loosely connected processes, so that things like bokeh / analytics / instrumentation can’t slow down scheduling. This is perhaps the most successful aspect of IPython Parallel’s zmq redesign: ‘schedulers’ are lightweight distributors of messages, and everything heavy happens in a ‘Hub’ process that gets passively notified of everything that passed through the schedulers.
Diagramming message flow patterns is extremely helpful for designing zmq-connected applications, as you have to think a lot about what sockets are right to use where, and there can be several valid options.
In one sense, it doesn’t. ZeroMQ hides connect/disconnect events, which can be nice, but also frustrating. The main thing it does is ensure that messages are delivered completely or not at all - a message that is incompletely delivered is considered not at all delivered. Depending on the socket type, if you try to send to a peer that is gone, the message will either wait in memory to be sent, or be discarded as undeliverable.
Each zmq socket corresponds to one or more FD. Due to zmq’s internal use of file descriptors for inter-thread signaling, I would expect to hit fd limits sooner, not later, with zmq.
I wouldn’t think so, unless you use a transport other than tcp. There is one potential difference, in that the connection handshake all happens in a GIL-less IO thread in C++, so that latency may be hidden behind other work going on in Python.
I’m not quite sure what you mean, but zeromq doesn’t generally keep ‘closed’ connections to be re-used on demand. You would have to use similar explicit connect/close calls to manage your FD usage.
This situation should be improved with zmq. All of the actual network IO happens in one or more GIL-less C++ threads, so the only part that’s in contest with blocking Python is handing the memory off to libzmq, which is very quick.
A send from Python involves:
Socket.send(msg)
, with either:copy=False
: builds a message with a pointer to Python-owned memory (no copies, but more bookkeeping)copy=True
: makes an in-memory copy of the message data, owned by C++zmq_send
, which is ~instant, since it is only passing a pointer aroundif not copy:
IO thread informs Python that it’s done with memoryMost of the work with pyzmq happens in a GIL-less C++ thread. A
zmq_send
is really handing a pointer to C++, so generally completes extremely quickly, and completes in a length of time independent of the sizeYes, absolutely. The Jupyter notebook is a regular tornado webapp that uses the zmq integration for communication with kernels. Call
ioloop.install()
to tell tornado to use pyzmq’s poller implementation.I think you will want to keep some of this. One thing in particular is sending very large messages. Since zeromq delivers only whole messages, you will likely want to chunk very large messages into multiple zmq sends. I don’t have a good answer for how big those chunks should be, though. This is something Jupyter / IPython parallel doesn’t support, and hasn’t really felt a need for.
IPython / Jupyter’s design makes it relatively simple to swap out transports. It’s been a bit since I looked at distributed’s implementation, but if you have these key abstractions:
experimenting with a zmq implementation shouldn’t be too disruptive.
@pitrou’s points:
pyzmq 15 adds a Futures-based API for both tornado and asyncio coroutines.
zeromq 4.2 does allow setting connect timeout, and you can monitor connect/disconnect events, though it’s generally recommended to not use this kind of thing for anything other than debugging.
This is certainly true, and if anything it could increase the overhead slightly. You can use the
zmq.FD
interface to hook into a faster eventloop such as uv, but this is an edge-triggered FD that I find an absolute nightmare to work with.Some notes while reviewing the ZeroMQ / pyzmq docs:
Things ZeroMQ can bring us:
A note about ZeroMQ’s batching: it only waits for the local network interface (or kernel buffer) to be ready for sending, which probably doesn’t mean much unless we’re saturating the network capacity. Our BatchedSend does more, as it has a built-in timer to coalesce messages.