Got a stuck , when running train.py
See original GitHub issueOS: Linux version 2.6.32-696.6.3.el6.x86_64 (Red Hat 4.4.7-18) CUDA : 9.1 CUDNN : 8.0
I have compile pytorch and fairseq successfully on my machine, and also executed preprocess command with my data. But when I tried to run train.py , I got this problem.
python train.py data-bin/de_en --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
above is my command
top Info
| PID |USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND |
|16409 | suxia | 20 | 0 | 85.4g | 100m| 36m |R | 99.9 | 0.2 | 2:59.75 |python |
python COMMAND above is faire-seq traning process, it tring to apply almost 85 G (VIRT) memory , but only 100M(RES) used.
what might be wrong ? THAKNS!
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
My problem is solved. The problem is that the version does not match.
Thank you for your reply.
@yangsuxia I’m not sure about your setup, but since you use CUDA 9, do you by any chance have magma-cuda80 installed? E.g. if you use conda, you can check with this command:
conda list
If yes, you need to uninstall it (because you use CUDA 9) and update to magma-cuda90, e.g. like this:
conda install magma-cuda90 -c pytorch