question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The problem of train

See original GitHub issue

When I run train.py, there is an error. What is the problem?The error message is as follows:

| epoch 001: 0%| | 0/820 [00:00<?, ?it/s]/home/suxia/anaconda3/envs/python36/lib/python3.6/site-packages/torch/autograd/function.py:41: UserWarning: mark_shared_storage is deprecated. Tensors with shared storages are automatically tracked. Note that calls to set_() are not tracked 'mark_shared_storage is deprecated. ’ THCudaCheck FAIL file=/home/suxia/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory | WARNING: ran out of memory, skipping batch Traceback (most recent call last): File “train.py”, line 29, in <module> main(args) File “train.py”, line 23, in main singleprocess_main(args) File “/home/suxia/fairseq-LM-0522/singleprocess_train.py”, line 80, in main train(args, trainer, dataset, epoch, batch_offset) File “/home/suxia/fairseq-LM-0522/singleprocess_train.py”, line 146, in train log_output = trainer.train_step(sample) File “/home/suxia/fairseq-LM-0522/fairseq/trainer.py”, line 103, in train_step grad_norm, ooms_bwd = self._backward_and_opt(loss, grad_denom) File “/home/suxia/fairseq-LM-0522/fairseq/trainer.py”, line 189, in backward_and_opt p.grad.data.div(grad_denom) AttributeError: ‘NoneType’ object has no attribute ‘data’

Looking forward to your reply, thank you!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
sankuniucommented, Jun 8, 2018

@myleott Important Notice! When I install the NCCL(https://developer.nvidia.com/nccl/nccl-download.) first and then build Pytorch, install fairseq, the dual GPUs could work well. Otherwise, Installation NCCL after building Pytorch, the result show error like “RuntimeError: the distributed NCCL backend is not available; try to recompile the THD package with CUDA and NCCL 2+ support at /home/z/pytorch/torch/lib/THD/process_group/General.cpp:17”

2reactions
edunovcommented, May 29, 2018

Also, please, make sure your dictionary size is not very big, let’s say no bigger than 50k tokens.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trolley problem - Wikipedia
The trolley problem is a series of thought experiments in ethics and psychology, involving stylized ethical dilemmas of whether to sacrifice one person...
Read more >
Train Problem - TV Tropes
The Train Problem trope as used in popular culture. The official math problem of TV Land, meant to resemble grade school mathematics problems....
Read more >
America's Railroads Are in Trouble–With or Without a Strike
Not to mention that if railroads lose more market share, the major rail companies will have to tear up tracks, lay off employees,...
Read more >
Opinion | Our Trouble With Trains - The New York Times
As American rail lines became freight lines, they had no need to build or maintain the tracks necessary for higher-speed passenger traffic.
Read more >
Problems on Trains - Concept, Tips, Tricks and ... - Byju's
Similar to the concept of speed, distance and time, train problems are specifically based on evaluating the speed, distance covered and time is ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found