Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running through Dockerfile broken

See original GitHub issue

Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

To Reproduce Steps to reproduce the behavior:

Build an image using the provided Dockerfile
Run said image, mounting 8 RTX800 GPUs
Fetch enron data using the prepare_dataset.py script
Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

Expected behavior Training starts or a specific error is provided.

Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

Environment (please complete the following information):

GPUs: 8 RTX8000 GPUs
Configs: Ubuntu 20.04, Cuda 11.2

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 2 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

4reactions

VHellendoorncommented, Oct 12, 2021

That worked! Adding –shm-size=1g –ulimit memlock=-1 to the nvidia-docker run (for others: note that this won’t work with docker run --runtime=nvidia) command solved it. The default shared memory was 64MB, which is evidently far too little.

Thanks so much for your help!

0reactions

PyxAIcommented, Jun 28, 2022

That worked! Adding –shm-size=1g –ulimit memlock=-1 to the nvidia-docker run (for others: note that this won’t work with docker run --runtime=nvidia) command solved it. The default shared memory was 64MB, which is evidently far too little.

Thanks so much for your help!

That worked for me too Just want to mention that the dashes in the command here needs to be replaced -shm-size=1g -ulimit memlock=-1 – != -