question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Got a stuck , when running train.py

See original GitHub issue

OS: Linux version 2.6.32-696.6.3.el6.x86_64 (Red Hat 4.4.7-18) CUDA : 9.1 CUDNN : 8.0

I have compile pytorch and fairseq successfully on my machine, and also executed preprocess command with my data. But when I tried to run train.py , I got this problem.

python train.py data-bin/de_en --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

above is my command

top Info

| PID |USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|16409 | suxia | 20 | 0 | 85.4g | 100m| 36m |R | 99.9 | 0.2 | 2:59.75 |python |

python COMMAND above is faire-seq traning process, it tring to apply almost 85 G (VIRT) memory , but only 100M(RES) used.

what might be wrong ? THAKNS!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
yangsuxiacommented, Dec 26, 2017

My problem is solved. The problem is that the version does not match.

Thank you for your reply.

0reactions
edunovcommented, Dec 22, 2017

@yangsuxia I’m not sure about your setup, but since you use CUDA 9, do you by any chance have magma-cuda80 installed? E.g. if you use conda, you can check with this command:

conda list

If yes, you need to uninstall it (because you use CUDA 9) and update to magma-cuda90, e.g. like this:

conda install magma-cuda90 -c pytorch

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Model training seems to get stuck? - Stack Overflow
@cberkay First, I get the batches through tf.train.batch and then feed to the graph in train_op.run() step and then again I get new...
Read more >
I got stuck while running training - DeepSpeech
Hi! I got stuck in the training process. I am on macOS Big Sur. I use this command: python3 DeepSpeech.py ​​–train_files …
Read more >
Ray Trainer prepare_model gets stuck
I am using Ray Trainer in a typical training setup for ... world_size=2] 2022-03-04 17:17:08,234|INFO trainer.py:196 -- Run results will be ...
Read more >
Get stuck on running distributed training using ...
Hi guys, currently I am trying to set up a distributed training cluster using 2 Linux GPU machines. My runtime is the latest...
Read more >
Distributed training got stuck every few seconds
There seems always one GPU got stuck whose utilization is 0%, ... mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found