pin memory problem in dgx-a100
See original GitHub issue❓ Questions and Help
I used dgx-a100(8*a100) to train graphsage with UnifiedTensor, while it seems that there is something wrong. I thought a lot but found nothing about the reason and how to solve it. So what is the reason of the problem?

Issue Analytics
- State:
- Created a year ago
- Comments:12
Top Results From Across the Web
NVIDIA DGX A100 User Guide
Obtaining the DGX A100 Software ISO Image and Checksum File. ... memory, and storage, and also specify the duration of the tests. ......
Read more >DGX A100 review: Throughput and Hardware Summary
With two 64-core EPYC CPUs and 1TB or 2TB of system memory, the DGX A100 boasts respectable performance even before the GPUs are...
Read more >My adventures with MicroK8s to enable GPU and use MIG on ...
Hi folks. I wanted to share my story about using MicroK8s and MIG (Multi-Instance GPU) on an Nvidia DGX A100 server.
Read more >NVIDIA DGX A100: Universal System for AI Infrastructure
Get your NVIDIA DGX A100 - one platform for every AI workload from Colfax International - a ... 1 8X NVIDIA A100 GPUS...
Read more >NVIDIA Announces A100 80GB: Ampere Gets HBM2E ...
And as an added kick, NVIDIA is dialing up the memory clockspeeds as well, bringing the 80GB version of the A100 to 3.2Gbps/pin, ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.
To verify it, you can run the following code to see if the error still happens.
Hello~Could you give me a docker that is able to run
python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uvaon a100?