question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Installing TorchRec in Nvidia PyTorch 22.07 container from NGC

See original GitHub issue

Hello. I’m trying to install TorchRec inside nvcr.io/nvidia/pytorch:22.07-py3 container that comes with CUDA 11.7. The installation itself looks successful but when I try to do import torchrec in Python later I get some errors that apparently are related to fbgemm_gpu package.

The simplest reproducibility instruction I can offer is:

docker run nvcr.io/nvidia/pytorch:22.07-py3 bash -c 'pip install torchrec && python -c "import torchrec"'

The error message is:

libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 203, in __getattr__
    op, overload_names = torch._C._jit_get_operation(qualified_op_name)
RuntimeError: No such operator fbgemm::jagged_2d_to_dense

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/torchrec/__init__.py", line 8, in <module>
    import torchrec.distributed  # noqa
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/__init__.py", line 36, in <module>
    from torchrec.distributed.model_parallel import DistributedModelParallel  # noqa
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/model_parallel.py", line 21, in <module>
    from torchrec.distributed.planner import (
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/planner/__init__.py", line 22, in <module>
    from torchrec.distributed.planner.planners import EmbeddingShardingPlanner  # noqa
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/planner/planners.py", line 16, in <module>
    from torchrec.distributed.planner.constants import BATCH_SIZE, MAX_SIZE
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/planner/constants.py", line 10, in <module>
    from torchrec.distributed.embedding_types import EmbeddingComputeKernel
  File "/opt/conda/lib/python3.8/site-packages/torchrec/distributed/embedding_types.py", line 14, in <module>
    from fbgemm_gpu.split_table_batched_embeddings_ops import EmbeddingLocation
  File "/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/__init__.py", line 22, in <module>
    from . import _fbgemm_gpu_docs
  File "/opt/conda/lib/python3.8/site-packages/fbgemm_gpu/_fbgemm_gpu_docs.py", line 18, in <module>
    torch.ops.fbgemm.jagged_2d_to_dense,
  File "/opt/conda/lib/python3.8/site-packages/torch/_ops.py", line 207, in __getattr__
    raise AttributeError(f"'_OpNamespace' object has no attribute '{op_name}'") from e
AttributeError: '_OpNamespace' object has no attribute 'jagged_2d_to_dense'

Some details on library versions (from pip freeze):

torch==1.13.0a0+08820cb
torchrec==0.3.1 
fbgemm-gpu==0.3.0

Do you have any idea what goes wrong here?

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
samiwilfcommented, Nov 29, 2022

@janekl I was able to reproduce the issue and resolve it.
Run: pip uninstall torch pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu117

That should resolve the issue. All three (torch, fbgemm_gpu, and torchrec) should be nightly versions.

One sidenote, latest changes adding support for shuffle on criteo load was added on 11.24, which is after latest torchrec-nightly release dated 2022.11.21. Checking out the prior commit in the facebookresearch/dlrm repo will resolve that.

1reaction
YLGHcommented, Nov 28, 2022

please install the nightly version of fbgemm-gpu if you’re using torch-nightly

I think

pip uninstall fbgemm-gpu
pip install fbgemm-gpu-nightly

should work

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyTorch | NVIDIA NGC
The PyTorch NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance.
Read more >
Issues · pytorch/torchrec - GitHub
Contribute to pytorch/torchrec development by creating an account on GitHub. ... Installing TorchRec in Nvidia PyTorch 22.07 container from NGC.
Read more >
Serving a Torch-TensorRT model with Triton - PyTorch
Let's first pull the NGC PyTorch Docker container. ... pip install torchvision pip install attrdict pip install nvidia-pyindex pip install tritonclient[all].
Read more >
NVIDIA NGC Tutorial: Run a PyTorch Docker Container using ...
This tutorial shows you how to install Docker with GPU support on Ubuntu Linux. To get GPU passthrough to work, you'll need docker, ......
Read more >
Use NVIDIA + Docker + VScode + PyTorch for Machine Learning
See how to install NVIDIA drivers, docker & nvidia docker, run gpu accelerated containers with PyTorch, develop with VSCode within the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found