Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pin memory problem in dgx-a100

See original GitHub issue

❓ Questions and Help

I used dgx-a100(8*a100) to train graphsage with UnifiedTensor, while it seems that there is something wrong. I thought a lot but found nothing about the reason and how to solve it. So what is the reason of the problem? `U{5@GS%@UQ8V8CP8$2T68V

Issue Analytics

State:
Created a year ago
Comments:12

Top GitHub Comments

1reaction

yaox12commented, Aug 15, 2022

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.

To verify it, you can run the following code to see if the error still happens.

import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()

0reactions

zqj2333commented, Aug 19, 2022

Cannot reproduce… I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines
        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)
to
        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)
Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

Hello~Could you give me a docker that is able to run python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva on a100?