question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Abrupt exit when training a model

See original GitHub issue

Hi all,

Thanks for putting this package up! I really love the idea behind it and can’t wait to integrate it more tightly with my workflow!

I’m trying to integrate Caliban with one of my smaller projects I’m working on here, but I’m having some trouble getting things to run. I added the requirements.txt file as instructed, but when I run the training script, I don’t see any visible error and the process exits abruptly.

I’m using a Mac, and my data is stored at /Users/dilip.thiagarajan/data. Here’s exactly what I did:

  • In that repository, I first tried running:
caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data

When I run this from the terminal, I see the following output:

dilip.thiagarajan simclr_pytorch % caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data                    
I0624 22:07:53.246673 4578139584 docker.py:614] Running command: docker build --rm -f- /Users/dilip.thiagarajan/code/simclr_pytorch
Sending build context to Docker daemon  110.6kB

Step 1/11 : FROM gcr.io/blueshift-playground/blueshift:cpu
 ---> fafdb20241ad
Step 2/11 : RUN [ $(getent group 20) ] || groupadd --gid 20 20
 ---> Using cache
 ---> 6b724e6c1e38
Step 3/11 : RUN useradd --no-log-init --no-create-home -u 502 -g 20 --shell /bin/bash dilip.thiagarajan
 ---> Using cache
 ---> 251bdcb68ec9
Step 4/11 : RUN mkdir -m 777 /usr/app /.creds /home/dilip.thiagarajan
 ---> Using cache
 ---> d2952e2052e3
Step 5/11 : ENV HOME=/home/dilip.thiagarajan
 ---> Using cache
 ---> d8c700640045
Step 6/11 : WORKDIR /usr/app
 ---> Using cache
 ---> 8d6fd0c9f3f4
Step 7/11 : USER 502:20
 ---> Using cache
 ---> 293fcdb3733f
Step 8/11 : COPY --chown=502:20 requirements.txt /usr/app
 ---> Using cache
 ---> 9074b050a5de
Step 9/11 : RUN /bin/bash -c "pip install --no-cache-dir -r requirements.txt"
 ---> Using cache
 ---> 60f28d41deb9
Step 10/11 : COPY --chown=502:20 . /usr/app/.
 ---> 74b6d6b6d42f
Step 11/11 : ENTRYPOINT ["python", "train.py"]
 ---> Running in 54a219fe9826
Removing intermediate container 54a219fe9826
 ---> 081b2c362108
Successfully built 081b2c362108
I0624 22:07:54.054889 4578139584 util.py:710] Restoring pure python logging
I0624 22:07:54.057392 4578139584 docker.py:707]                                                                                                                                                
I0624 22:07:54.057760 4578139584 docker.py:708] Job 1 - Experiment args: ['--model_name', 'resnet18', '--projection_dim', '64', '--fast_dev_run', 'True', '--download', '--data_dir', '/data'] 
I0624 22:07:54.057989 4578139584 docker.py:787] Running command: docker run --ipc host --volume /Users/dilip.thiagarajan/data:/data 081b2c362108 --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data
Executing:   0%|                                                                                                                                                 | 0/1 [00:00<?, ?experiment/s]Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /home/dilip.thiagarajan/.cache/torch/checkpoints/resnet18-5c106cde.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 52.8MB/s]
Running in fast_dev_run mode: will run a full train, val and test loop using a single batch
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
/opt/conda/envs/caliban/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)

  | Name            | Type            | Params
----------------------------------------------------
0 | model           | Sequential      | 11 M  
1 | projection_head | Linear          | 32 K  
2 | loss            | NTXEntCriterion | 0     
Files already downloaded and verified                                                                                                                                                          
Files already downloaded and verified                                                                                                                                                          
Training: 0it [00:00, ?it/s]                                                                                                                                                                   
Training:   0%|          | 0/2 [00:00<?, ?it/s]                                                                                                                                                
E0624 22:08:09.984529 4578139584 docker.py:747] Job 1 failed with return code 137.                                                                                                             
E0624 22:08:09.984878 4578139584 docker.py:750] Failing args for job 1: ['--model_name', 'resnet18', '--projection_dim', '64', '--fast_dev_run', 'True', '--download', '--data_dir', '/data']  
Executing: 100%|#########################################################################################################################################| 1/1 [00:15<00:00, 15.93s/experiment]

while when I output to log by doing

caliban run --nogpu --docker_run_args "--volume /Users/dilip.thiagarajan/data:/data" train.py -- --model_name resnet18 --projection_dim 64 --fast_dev_run True --download --data_dir /data &> caliban_run.log &

I see the following in my trace:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/logging/__init__.py", line 2039, in shutdown
    h.close()
  File "/Users/dilip.thiagarajan/.pyenv/versions/3.7.3/lib/python3.7/site-packages/absl/logging/__init__.py", line 864, in close
    self.stream.close()
AttributeError: 'TqdmFile' object has no attribute 'close'

Is this a problem with some interaction with logging and tqdm? Or is it something I’m doing that’s incorrect when I’m mounting my data directory?

The following works properly for me locally: python3 train.py --model_name resnet18 --projection_dim 64 --fast_dev_run True --data_dir ~/data --download

Thanks for your help!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
sritchiecommented, Jun 26, 2020

@dthiagarajan The details here are:

  • tqdm uses carriage returns, like \r, to rewrite the current line. Python doesn’t pass those through without some work, when you’re running another python job in a subprocess.
  • Python buffers its output, which is a mess here, because tqdm uses both stdout and stderr to write its outputs.
  • Docker doesn’t have a COLUMNS or LINES variable internally when you run a container in non-interactive mode!

#31 tackles each of these. It’s not perfect — I suspect if you nest progress bars, you may run into trouble, but maybe not. If you have a tqdm process and write a bunch of output inside the loop, that might trigger a newline as well.

But this solves most of the issues we’d seen, and I think you’ll be happier with the result for sure.

1reaction
sritchiecommented, Jun 25, 2020

This was a world-class bug report, by the way! Thanks for the care it took to write.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sudden drop in loss while training a model and stuck ... - Reddit
Sudden drop in loss while training a model and stuck at the same loss and accuracy for the last 5 epochs. r/deeplearning -...
Read more >
Mask R-CNN stops abruptly while training using custom coco ...
My custom dataset only has one class. The training just stops suddenly after saving model.step-0.tlt without any error or warning. It is ...
Read more >
Sudden exit on relatively small model gpu - PyTorch Forums
this is one of my first times using pytorch so it's likely something small and stupid . I have a very small model...
Read more >
Process finished with exit code 137 in PyCharm - Stack Overflow
this is because of Memory issue. When I try to train ml model using sklearn fit with full data set , it abruptly...
Read more >
My Unsteady HEC-RAS Model is Unstable…Why?
It could be a sudden increase/decrease in flow. It could be a sudden increase/decrease in stage. Whatever steps you take to try to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found