question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running through Dockerfile broken

See original GitHub issue

Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

To Reproduce Steps to reproduce the behavior:

  1. Build an image using the provided Dockerfile
  2. Run said image, mounting 8 RTX800 GPUs
  3. Fetch enron data using the prepare_dataset.py script
  4. Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
  5. The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

Expected behavior Training starts or a specific error is provided.

Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

Environment (please complete the following information):

  • GPUs: 8 RTX8000 GPUs
  • Configs: Ubuntu 20.04, Cuda 11.2

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
VHellendoorncommented, Oct 12, 2021

That worked! Adding –shm-size=1g –ulimit memlock=-1 to the nvidia-docker run (for others: note that this won’t work with docker run --runtime=nvidia) command solved it. The default shared memory was 64MB, which is evidently far too little.

Thanks so much for your help!

0reactions
PyxAIcommented, Jun 28, 2022

That worked! Adding –shm-size=1g –ulimit memlock=-1 to the nvidia-docker run (for others: note that this won’t work with docker run --runtime=nvidia) command solved it. The default shared memory was 64MB, which is evidently far too little.

Thanks so much for your help!

That worked for me too Just want to mention that the dashes in the command here needs to be replaced -shm-size=1g -ulimit memlock=-1 – != -

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Tell That Your Docker Setup Is Broken - vsupalov.com
Getting started with Docker is easy, but it takes time and experience to avoid even the most common ways to build broken images...
Read more >
Broken by default: why you should avoid most Dockerfile ...
Be careful what you learn from. A broken Docker image can lead to production outages, and building best-practices images is a lot harder...
Read more >
How to Fix and Debug Docker Containers Like a Superhero
Container errors are tricky to diagnose, but some investigative magic works wonders. Read along to learn how to debug Docker containers.
Read more >
9 Common Dockerfile Mistakes - Runnablog
9 Common Dockerfile Mistakes · 1. Running apt-get · 2. Using ADD instead of COPY · 3. Adding your entire application directory in...
Read more >
Docker build breaks even though nothing in Dockerfile or ...
You must RUN apt-get update && apt-get install ... in a single command; otherwise Docker's image caching will cache the (old) update results...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found