question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avoid exposing functions overlapping with NCCL names

See original GitHub issue

Currently CuPy exposes NCCL stub functions that overlap with those that NCCL itself provides. This is done depending on whether a CUDA enabled build of CuPy was produced. Also which functions are exposed depends on which version of NCCL was used in the CuPy build. If the latest version of NCCL (supported by CuPy) was not available during the build, CuPy will expose some functions with names that overlap with those in NCCL. This could be an issue if CuPy were built against an older version of NCCL (like 2.3) and installed on a system with a newer version of NCCL (like 2.4). As a result both CuPy and NCCL will have symbols that clash. This would be detrimental to a user’s program (causing crashes).

To fix this issue, it would be helpful if CuPy renamed the stub functions in cupy_nccl.h with unique names not found in NCCL. A simple strategy might be to prefix all stub functions with cupy_*. To simplify usability and maintainability, CuPy could make these stub functions either call NCCL or fallback to the some default behavior (like return ncclSuccess) depending on NCCL version.

Though this is merely one option. There could be other viable options. In any event it would be useful to avoid symbol clashes with NCCL.

cc @isuruf @leofang

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
kmaehashicommented, Oct 25, 2019

Thanks for pointing this issue out. The same thing can be said to cuDNN as we’re exposing symbols whose names are same as cuDNN depending on build-time cuDNN version, and cudnn is linked as libcudnn.so.7 (not libcudnn.so.7.5).

0reactions
kmaehashicommented, Jan 14, 2020

Discussed these ideas in the dev team, and concluded that the idea 2 sounds a reasonable solution. Does anyone interested in working on this issue?

cc/ @pentschev @jekbradbury @anaruse

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed communication package - torch.distributed - PyTorch
This function requires that all processes in the main group (i.e. all processes that are part of the distributed job) enter this function,...
Read more >
C++ external Function Names Overlapping - Stack Overflow
I would like to keep my method name since it describes best whats happening. I have no control over the name of the...
Read more >
Chapter 39. Parallel Prefix Sum (Scan) with CUDA
Our goal in this section is to develop a work-efficient scan algorithm for CUDA that avoids the extra factor of log2 n work...
Read more >
ray.util.collective.collective — Ray 2.1.0 - the Ray documentation
Backend.NCCL, group_name: str = "default", ): """Declare a list of actors as a collective group. Note: This function should be called in a...
Read more >
PyTorch Distributed - arXiv
parallel, including bucketing gradients, overlapping compu- ... parison between NCCL and Gloo communication libraries.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found