Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Got a stuck , when running train.py

See original GitHub issue

OS: Linux version 2.6.32-696.6.3.el6.x86_64 (Red Hat 4.4.7-18) CUDA : 9.1 CUDNN : 8.0

I have compile pytorch and fairseq successfully on my machine, and also executed preprocess command with my data. But when I tried to run train.py , I got this problem.

python train.py data-bin/de_en --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

above is my command

top Info

| PID |USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|16409 | suxia | 20 | 0 | 85.4g | 100m| 36m |R | 99.9 | 0.2 | 2:59.75 |python |

python COMMAND above is faire-seq traning process, it tring to apply almost 85 G (VIRT) memory , but only 100M(RES) used.

what might be wrong ? THAKNS!

Issue Analytics

State:
Created 6 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

yangsuxiacommented, Dec 26, 2017

My problem is solved. The problem is that the version does not match.

Thank you for your reply.

0reactions

edunovcommented, Dec 22, 2017

@yangsuxia I’m not sure about your setup, but since you use CUDA 9, do you by any chance have magma-cuda80 installed? E.g. if you use conda, you can check with this command:

conda list

If yes, you need to uninstall it (because you use CUDA 9) and update to magma-cuda90, e.g. like this:

conda install magma-cuda90 -c pytorch