question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

slow pmap allreduce

See original GitHub issue

Related to @joschu’s question about direct access to device arrays, I was curious how fast a pmap allreduce would be as an alternative to trying to use nccl directly on GPU pointers.

This script (https://gist.github.com/christopherhesse/192d78f0f082d66dfb26cac112c5cf99) takes 10,000 ms per loop on 8 V100s, which is surprising to me because nccl-tests’ all_reduce_perf takes about 5 ms to do what I think is the same operation. Is there an error in my script? I tried using .block_until_ready() instead of np.array() but that failed with an exception, so there’s an additional copy to host memory, but even with that it seems like it should be faster.

@jekbradbury commented on a similar issue here: https://github.com/google/jax/issues/606#issuecomment-485063016

I’m using jaxlib 0.1.21 and (I think) jax 1508405ce619e40f43c90f3c34d6af7d0a81ddd5.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
christopherhessecommented, Jul 6, 2019

Thanks! That should be enough information to investigate the difference in speed if it ends up impacting my application’s performance.

If we end up using NCCL directly then I expect we will not have to copy much data to main memory so this particular issue may not matter as much to me (especially if it is somehow pmap specific).

0reactions
mattjjcommented, Jul 5, 2019

Hrm not sure about the GCP thing. I can dig more into the config I’m using if that would be helpful, but the basics are:

n1-standard-64 (64 vCPUs, 240 GB memory) in us-west-1b 300GB SSD persistent disk 8 x NVIDIA Tesla V100 “Deep Learning VM” image (maybe it’s called tf-1-13-cu100-20190524) Miniconda (Anaconda) Python 3.7 jax from github master, jaxlib 0.1.21 from pypi

Does this mean that in my case it takes 9000 ms (and still 3400 ms in your case) just to copy the data to the host? That seems odd to me.

cc @hawkinsp for someone who knows how computers are supposed to work. Any thoughts?

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenMPI - MPI_Allreduce is taking too long - Stack Overflow
I have these lines of code to make an Allreduce of two arrays doubles: MPI_Allreduce (A, A_copy, size1, MPI_DOUBLE, MPI_SUM, ...
Read more >
Mapreduce Job - Cloudera Community - Cloudera Community
Hi,. We have written a mapreduce job to process log files. As of now we have around 52GB of input files but it...
Read more >
Allreduce (or MPI) vs. Parameter server approaches
Iteration in a map-reduce paradigm remains awkward/slow in many cases. Graph-based approaches. (see here (2010)).
Read more >
Distributed model training II: Parameter Server and AllReduce
Each server updates the gradient in parallel (map), and the central server aggregates all updated gradients as the end of each iteration ...
Read more >
Allreduce for Parallel Learning
Allreduce for Parallel Learning ... “Map” job moves program to data. ... slow. Hadoop can speculatively start additional mappers. We.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found