RuntimeError: Overflow when unpacking long
See original GitHub issueEnvironment info
Machine
: Google Cloud TPU VM versionv2-alpha
transformers
: 4.18.0accelerate
:0.9.0.dev0
(same error happen with0.8.0.dev0
)
Script
I am training a GPT2
model using Pytorch run_clm_no_trainer.py.
Error
Below error happen when model is saving checkpoints. But seem that it only occurs at third or second checkpoint.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/launch.py", line 55, in __call__
self.launcher(*args)
File "/home/nguyenhuuthuat09/gpt2/train_v1.py", line 553, in main
accelerator.save_state(output_dir) <---- this is line 564 in the original run_clm_no_trainer.py
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 799, in save_state
save_location = save_accelerator_state(
File "/usr/local/lib/python3.8/dist-packages/accelerate/checkpointing.py", line 105, in save_accelerator_state
states["xm_seed"] = torch.tensor(xm.get_rng_state())
RuntimeError: Overflow when unpacking long
Exception in device=TPU:0: Overflow when unpacking long
Enviroment variable
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
- I run
accelerate config
and useaccelerate launch
to run the code. - After the error happen, I tried two below command but it doesn’t help.
export XLA_USE_BF16=1
export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000
Releated issue
- I think this issue might be related to: https://github.com/huggingface/transformers/issues/10212
- I think the problem is on this line: https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L105
- I guess
states["xm_seed"] = torch.tensor(xm.get_rng_state())
->states["xm_seed"] = torch.tensor(xm.get_rng_state(), dtype=torch.float32)
may help?
Thank you for great library!
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
python 3.x - Overflow when unpacking long - Pytorch
Since, torch.empty() gives uninitialized memory, so you may or may not get a large value from it. Try x = torch.rand(5, 3) print(x)....
Read more >Overflow when unpacking long, during FX mode calibration ...
RuntimeError : Overflow when unpacking long, during FX mode calibration. Hello, I am following FX mode post training static quantization tutorial, and got...
Read more >Loading a Pytorch model? by Joe Bastulli - QuantConnect.com
I keep getting a runtime error when I try to load the path. RuntimeError : Overflow when unpacking long. Update Backtest. Project. Loading...
Read more >DeepStability - A Database of Numerical Methods for Deep ...
Index Library Commit hash Language Type of commit
1 PyTorch ac72881f3ff8c46c2a5cf8b09d02babf46bc4c85 CUDA Fix
2 PyTorch dfc7fa03e5d33f909b9d7853dd001086f5d782a0 Python Fix
3 PyTorch 8e507ad00ebdfd0ae84bc03718e9c2cb74b8573b yaml Fix
Read more >[Example code]-PyTorch can't use a float type but only long
I am trying to run this very basic neural network: import os; os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" import torch import torchvision import torch.nn as ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The seed is an int, not a float @nguyenhuuthuat09, you won’t be able to reload that RNG state if you save it as float.
The proper fix is to jsut remove
torch.tensor
here.Not sure why it’s wrapped inside a Tensor in the first place, @muellerzr ?