Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

slow pmap allreduce

See original GitHub issue

Related to @joschu’s question about direct access to device arrays, I was curious how fast a pmap allreduce would be as an alternative to trying to use nccl directly on GPU pointers.

This script (https://gist.github.com/christopherhesse/192d78f0f082d66dfb26cac112c5cf99) takes 10,000 ms per loop on 8 V100s, which is surprising to me because nccl-tests’ all_reduce_perf takes about 5 ms to do what I think is the same operation. Is there an error in my script? I tried using .block_until_ready() instead of np.array() but that failed with an exception, so there’s an additional copy to host memory, but even with that it seems like it should be faster.

@jekbradbury commented on a similar issue here: https://github.com/google/jax/issues/606#issuecomment-485063016

I’m using jaxlib 0.1.21 and (I think) jax 1508405ce619e40f43c90f3c34d6af7d0a81ddd5.

Issue Analytics

State:
Created 4 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

christopherhessecommented, Jul 6, 2019

Thanks! That should be enough information to investigate the difference in speed if it ends up impacting my application’s performance.

If we end up using NCCL directly then I expect we will not have to copy much data to main memory so this particular issue may not matter as much to me (especially if it is somehow pmap specific).

0reactions

mattjjcommented, Jul 5, 2019

Hrm not sure about the GCP thing. I can dig more into the config I’m using if that would be helpful, but the basics are:

n1-standard-64 (64 vCPUs, 240 GB memory) in us-west-1b 300GB SSD persistent disk 8 x NVIDIA Tesla V100 “Deep Learning VM” image (maybe it’s called tf-1-13-cu100-20190524) Miniconda (Anaconda) Python 3.7 jax from github master, jaxlib 0.1.21 from pypi

Does this mean that in my case it takes 9000 ms (and still 3400 ms in your case) just to copy the data to the host? That seems odd to me.

cc @hawkinsp for someone who knows how computers are supposed to work. Any thoughts?

Top Results From Across the Web

OpenMPI - MPI_Allreduce is taking too long - Stack Overflow

I have these lines of code to make an Allreduce of two arrays doubles: MPI_Allreduce (A, A_copy, size1, MPI_DOUBLE, MPI_SUM, ...

Mapreduce Job - Cloudera Community - Cloudera Community

Hi,. We have written a mapreduce job to process log files. As of now we have around 52GB of input files but it...

Allreduce (or MPI) vs. Parameter server approaches

Iteration in a map-reduce paradigm remains awkward/slow in many cases. Graph-based approaches. (see here (2010)).

Distributed model training II: Parameter Server and AllReduce

Each server updates the gradient in parallel (map), and the central server aggregates all updated gradients as the end of each iteration ...

Allreduce for Parallel Learning

Allreduce for Parallel Learning ... “Map” job moves program to data. ... slow. Hadoop can speculatively start additional mappers. We.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

slow pmap allreduce

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

cannot find libdevice

Generating random numbers with `jax.random.split` can be >200x slower than `np.random.normal`