Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avoid exposing functions overlapping with NCCL names

See original GitHub issue

Currently CuPy exposes NCCL stub functions that overlap with those that NCCL itself provides. This is done depending on whether a CUDA enabled build of CuPy was produced. Also which functions are exposed depends on which version of NCCL was used in the CuPy build. If the latest version of NCCL (supported by CuPy) was not available during the build, CuPy will expose some functions with names that overlap with those in NCCL. This could be an issue if CuPy were built against an older version of NCCL (like 2.3) and installed on a system with a newer version of NCCL (like 2.4). As a result both CuPy and NCCL will have symbols that clash. This would be detrimental to a user’s program (causing crashes).

To fix this issue, it would be helpful if CuPy renamed the stub functions in cupy_nccl.h with unique names not found in NCCL. A simple strategy might be to prefix all stub functions with cupy_*. To simplify usability and maintainability, CuPy could make these stub functions either call NCCL or fallback to the some default behavior (like return ncclSuccess) depending on NCCL version.

Though this is merely one option. There could be other viable options. In any event it would be useful to avoid symbol clashes with NCCL.

cc @isuruf @leofang

Issue Analytics

State:
Created 4 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

kmaehashicommented, Oct 25, 2019

Thanks for pointing this issue out. The same thing can be said to cuDNN as we’re exposing symbols whose names are same as cuDNN depending on build-time cuDNN version, and cudnn is linked as libcudnn.so.7 (not libcudnn.so.7.5).

0reactions

kmaehashicommented, Jan 14, 2020

Discussed these ideas in the dev team, and concluded that the idea 2 sounds a reasonable solution. Does anyone interested in working on this issue?

cc/ @pentschev @jekbradbury @anaruse

Top Results From Across the Web

Distributed communication package - torch.distributed - PyTorch

This function requires that all processes in the main group (i.e. all processes that are part of the distributed job) enter this function,...

C++ external Function Names Overlapping - Stack Overflow

I would like to keep my method name since it describes best whats happening. I have no control over the name of the...

Chapter 39. Parallel Prefix Sum (Scan) with CUDA

Our goal in this section is to develop a work-efficient scan algorithm for CUDA that avoids the extra factor of log2 n work...

ray.util.collective.collective — Ray 2.1.0 - the Ray documentation

Backend.NCCL, group_name: str = "default", ): """Declare a list of actors as a collective group. Note: This function should be called in a...

PyTorch Distributed - arXiv

parallel, including bucketing gradients, overlapping compu- ... parison between NCCL and Gloo communication libraries.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Avoid exposing functions overlapping with NCCL names

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Support CUDA 10.1 Update 2

Single-precision `sum()` over certain axes sometimes deviates from NumPy's outcome