Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: Overflow when unpacking long

See original GitHub issue

Environment info

Machine: Google Cloud TPU VM version v2-alpha
transformers: 4.18.0
accelerate: 0.9.0.dev0 (same error happen with 0.8.0.dev0)

Script

I am training a GPT2 model using Pytorch run_clm_no_trainer.py.

Error

Below error happen when model is saving checkpoints. But seem that it only occurs at third or second checkpoint.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/launch.py", line 55, in __call__
    self.launcher(*args)
  File "/home/nguyenhuuthuat09/gpt2/train_v1.py", line 553, in main
    accelerator.save_state(output_dir)      <---- this is line 564 in the original run_clm_no_trainer.py
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 799, in save_state
    save_location = save_accelerator_state(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/checkpointing.py", line 105, in save_accelerator_state
    states["xm_seed"] = torch.tensor(xm.get_rng_state())
RuntimeError: Overflow when unpacking long
Exception in device=TPU:0: Overflow when unpacking long

Enviroment variable

export XRT_TPU_CONFIG="localservice;0;localhost:51011"
I run accelerate config and use accelerate launch to run the code.
After the error happen, I tried two below command but it doesn’t help. export XLA_USE_BF16=1 export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000

Releated issue

I think this issue might be related to: https://github.com/huggingface/transformers/issues/10212
I think the problem is on this line: https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L105
I guess states["xm_seed"] = torch.tensor(xm.get_rng_state()) -> states["xm_seed"] = torch.tensor(xm.get_rng_state(), dtype=torch.float32) may help?

Thank you for great library!

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

3reactions

sguggercommented, May 13, 2022

The seed is an int, not a float @nguyenhuuthuat09, you won’t be able to reload that RNG state if you save it as float.

The proper fix is to jsut remove torch.tensor here.

1reaction

sguggercommented, May 13, 2022

Not sure why it’s wrapped inside a Tensor in the first place, @muellerzr ?

Top Results From Across the Web

python 3.x - Overflow when unpacking long - Pytorch

Since, torch.empty() gives uninitialized memory, so you may or may not get a large value from it. Try x = torch.rand(5, 3) print(x)....

Overflow when unpacking long, during FX mode calibration ...

RuntimeError : Overflow when unpacking long, during FX mode calibration. Hello, I am following FX mode post training static quantization tutorial, and got...

Loading a Pytorch model? by Joe Bastulli - QuantConnect.com

I keep getting a runtime error when I try to load the path. RuntimeError : Overflow when unpacking long. Update Backtest. Project. Loading...

DeepStability - A Database of Numerical Methods for Deep ...

Index Library Commit hash Language Type of commit 1 PyTorch ac72881f3ff8c46c2a5cf8b09d02babf46bc4c85 CUDA Fix 2 PyTorch dfc7fa03e5d33f909b9d7853dd001086f5d782a0 Python Fix 3 PyTorch 8e507ad00ebdfd0ae84bc03718e9c2cb74b8573b yaml Fix

[Example code]-PyTorch can't use a float type but only long

I am trying to run this very basic neural network: import os; os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" import torch import torchvision import torch.nn as ...