Running through Dockerfile broken
See original GitHub issueDescribe the bug
When using an image based on the provided Dockerfile
and running the quick start steps (download enron data, run deep.py
), execution crashes before training begins.
To Reproduce Steps to reproduce the behavior:
- Build an image using the provided Dockerfile
- Run said image, mounting 8 RTX800 GPUs
- Fetch enron data using the
prepare_dataset.py
script - Run
./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
- The code crashes with a non-descript
NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
Expected behavior Training starts or a specific error is provided.
Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.
Environment (please complete the following information):
- GPUs: 8 RTX8000 GPUs
- Configs: Ubuntu 20.04, Cuda 11.2
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
How to Tell That Your Docker Setup Is Broken - vsupalov.com
Getting started with Docker is easy, but it takes time and experience to avoid even the most common ways to build broken images...
Read more >Broken by default: why you should avoid most Dockerfile ...
Be careful what you learn from. A broken Docker image can lead to production outages, and building best-practices images is a lot harder...
Read more >How to Fix and Debug Docker Containers Like a Superhero
Container errors are tricky to diagnose, but some investigative magic works wonders. Read along to learn how to debug Docker containers.
Read more >9 Common Dockerfile Mistakes - Runnablog
9 Common Dockerfile Mistakes · 1. Running apt-get · 2. Using ADD instead of COPY · 3. Adding your entire application directory in...
Read more >Docker build breaks even though nothing in Dockerfile or ...
You must RUN apt-get update && apt-get install ... in a single command; otherwise Docker's image caching will cache the (old) update results...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That worked! Adding
–shm-size=1g –ulimit memlock=-1
to thenvidia-docker run
(for others: note that this won’t work withdocker run --runtime=nvidia
) command solved it. The default shared memory was 64MB, which is evidently far too little.Thanks so much for your help!
That worked for me too Just want to mention that the dashes in the command here needs to be replaced -shm-size=1g -ulimit memlock=-1 – != -