The problem of train
See original GitHub issueWhen I run train.py, there is an error. What is the problem?The error message is as follows:
| epoch 001: 0%| | 0/820 [00:00<?, ?it/s]/home/suxia/anaconda3/envs/python36/lib/python3.6/site-packages/torch/autograd/function.py:41: UserWarning: mark_shared_storage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to set_()
are not tracked
'mark_shared_storage is deprecated. ’
THCudaCheck FAIL file=/home/suxia/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
| WARNING: ran out of memory, skipping batch
Traceback (most recent call last):
File “train.py”, line 29, in <module>
main(args)
File “train.py”, line 23, in main
singleprocess_main(args)
File “/home/suxia/fairseq-LM-0522/singleprocess_train.py”, line 80, in main
train(args, trainer, dataset, epoch, batch_offset)
File “/home/suxia/fairseq-LM-0522/singleprocess_train.py”, line 146, in train
log_output = trainer.train_step(sample)
File “/home/suxia/fairseq-LM-0522/fairseq/trainer.py”, line 103, in train_step
grad_norm, ooms_bwd = self._backward_and_opt(loss, grad_denom)
File “/home/suxia/fairseq-LM-0522/fairseq/trainer.py”, line 189, in backward_and_opt
p.grad.data.div(grad_denom)
AttributeError: ‘NoneType’ object has no attribute ‘data’
Looking forward to your reply, thank you!
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (5 by maintainers)
Top GitHub Comments
@myleott Important Notice! When I install the NCCL(https://developer.nvidia.com/nccl/nccl-download.) first and then build Pytorch, install fairseq, the dual GPUs could work well. Otherwise, Installation NCCL after building Pytorch, the result show error like “RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /home/z/pytorch/torch/lib/THD/process_group/General.cpp:17”
Also, please, make sure your dictionary size is not very big, let’s say no bigger than 50k tokens.