Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Installing TorchRec in Nvidia PyTorch 22.07 container from NGC

See original GitHub issue

Hello. I’m trying to install TorchRec inside nvcr.io/nvidia/pytorch:22.07-py3 container that comes with CUDA 11.7. The installation itself looks successful but when I try to do import torchrec in Python later I get some errors that apparently are related to fbgemm_gpu package.

The simplest reproducibility instruction I can offer is:

docker run nvcr.io/nvidia/pytorch:22.07-py3 bash -c 'pip install torchrec && python -c "import torchrec"'

The error message is:

libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 203, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/torchrec/__init__.py", line 8, in <module>
    import torchrec.distributed  # noqa
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/__init__.py", line 36, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/model_parallel.py", line 21, in <module>
    from torchrec.distributed.planner import (
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/planner/__init__.py", line 22, in <module>
    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/planner/planners.py", line 16, in <module>
    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/planner/constants.py", line 10, in <module>
    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/embedding_types.py", line 14, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops import EmbeddingLocation
  File "/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs
  File "/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 18, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 207, in __getattr__
    raise AttributeError(f"'_OpNamespace' object has no attribute '{op_name}'") from e
AttributeError: '_OpNamespace' object has no attribute 'jagged_2d_to_dense'

Some details on library versions (from pip freeze):

torch==1.13.0a0+08820cb
torchrec==0.3.1 
fbgemm-gpu==0.3.0

Do you have any idea what goes wrong here?

Issue Analytics

State:
Created 10 months ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

samiwilfcommented, Nov 29, 2022

@janekl I was able to reproduce the issue and resolve it.
Run: pip uninstall torch pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu117

That should resolve the issue. All three (torch, fbgemm_gpu, and torchrec) should be nightly versions.

One sidenote, latest changes adding support for shuffle on criteo load was added on 11.24, which is after latest torchrec-nightly release dated 2022.11.21. Checking out the prior commit in the facebookresearch/dlrm repo will resolve that.

1reaction

YLGHcommented, Nov 28, 2022

please install the nightly version of fbgemm-gpu if you’re using torch-nightly

I think

pip uninstall fbgemm-gpu
pip install fbgemm-gpu-nightly

should work

Top Results From Across the Web

PyTorch | NVIDIA NGC

The PyTorch NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.

Issues · pytorch/torchrec - GitHub

Contribute to pytorch/torchrec development by creating an account on GitHub. ... Installing TorchRec in Nvidia PyTorch 22.07 container from NGC.

Serving a Torch-TensorRT model with Triton - PyTorch

Let's first pull the NGC PyTorch Docker container. ... pip install torchvision pip install attrdict pip install nvidia-pyindex pip install tritonclient[all].

NVIDIA NGC Tutorial: Run a PyTorch Docker Container using ...

This tutorial shows you how to install Docker with GPU support on Ubuntu Linux. To get GPU passthrough to work, you'll need docker, ......

Use NVIDIA + Docker + VScode + PyTorch for Machine Learning

See how to install NVIDIA drivers, docker & nvidia docker, run gpu accelerated containers with PyTorch, develop with VSCode within the ...