question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pin memory problem in dgx-a100

See original GitHub issue

❓ Questions and Help

I used dgx-a100(8*a100) to train graphsage with UnifiedTensor, while it seems that there is something wrong. I thought a lot but found nothing about the reason and how to solve it. So what is the reason of the problem? `U{5@GS%@UQ8V8CP8$2T68V

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12

github_iconTop GitHub Comments

1reaction
yaox12commented, Aug 15, 2022

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.

To verify it, you can run the following code to see if the error still happens.

import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()
0reactions
zqj2333commented, Aug 19, 2022

Cannot reproduce… I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

Hello~Could you give me a docker that is able to run python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva on a100?

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVIDIA DGX A100 User Guide
Obtaining the DGX A100 Software ISO Image and Checksum File. ... memory, and storage, and also specify the duration of the tests. ......
Read more >
DGX A100 review: Throughput and Hardware Summary
With two 64-core EPYC CPUs and 1TB or 2TB of system memory, the DGX A100 boasts respectable performance even before the GPUs are...
Read more >
My adventures with MicroK8s to enable GPU and use MIG on ...
Hi folks. I wanted to share my story about using MicroK8s and MIG (Multi-Instance GPU) on an Nvidia DGX A100 server.
Read more >
NVIDIA DGX A100: Universal System for AI Infrastructure
Get your NVIDIA DGX A100 - one platform for every AI workload from Colfax International - a ... 1 8X NVIDIA A100 GPUS...
Read more >
NVIDIA Announces A100 80GB: Ampere Gets HBM2E ...
And as an added kick, NVIDIA is dialing up the memory clockspeeds as well, bringing the 80GB version of the A100 to 3.2Gbps/pin, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found