question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: Overflow when unpacking long

See original GitHub issue

Environment info

  • Machine: Google Cloud TPU VM version v2-alpha
  • transformers: 4.18.0
  • accelerate: 0.9.0.dev0 (same error happen with 0.8.0.dev0)

Script

I am training a GPT2 model using Pytorch run_clm_no_trainer.py.

Error

Below error happen when model is saving checkpoints. But seem that it only occurs at third or second checkpoint.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/launch.py", line 55, in __call__
    self.launcher(*args)
  File "/home/nguyenhuuthuat09/gpt2/train_v1.py", line 553, in main
    accelerator.save_state(output_dir)      <---- this is line 564 in the original run_clm_no_trainer.py
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 799, in save_state
    save_location = save_accelerator_state(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/checkpointing.py", line 105, in save_accelerator_state
    states["xm_seed"] = torch.tensor(xm.get_rng_state())
RuntimeError: Overflow when unpacking long
Exception in device=TPU:0: Overflow when unpacking long

Enviroment variable

  • export XRT_TPU_CONFIG="localservice;0;localhost:51011"
  • I run accelerate config and use accelerate launch to run the code.
  • After the error happen, I tried two below command but it doesn’t help. export XLA_USE_BF16=1 export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000

Releated issue

Thank you for great library!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
sguggercommented, May 13, 2022

The seed is an int, not a float @nguyenhuuthuat09, you won’t be able to reload that RNG state if you save it as float.

The proper fix is to jsut remove torch.tensor here.

1reaction
sguggercommented, May 13, 2022

Not sure why it’s wrapped inside a Tensor in the first place, @muellerzr ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

python 3.x - Overflow when unpacking long - Pytorch
Since, torch.empty() gives uninitialized memory, so you may or may not get a large value from it. Try x = torch.rand(5, 3) print(x)....
Read more >
Overflow when unpacking long, during FX mode calibration ...
RuntimeError : Overflow when unpacking long, during FX mode calibration. Hello, I am following FX mode post training static quantization tutorial, and got...
Read more >
Loading a Pytorch model? by Joe Bastulli - QuantConnect.com
I keep getting a runtime error when I try to load the path. RuntimeError : Overflow when unpacking long. Update Backtest. Project. Loading...
Read more >
DeepStability - A Database of Numerical Methods for Deep ...
Index Library Commit hash Language Type of commit 1 PyTorch ac72881f3ff8c46c2a5cf8b09d02babf46bc4c85 CUDA Fix 2 PyTorch dfc7fa03e5d33f909b9d7853dd001086f5d782a0 Python Fix 3 PyTorch 8e507ad00ebdfd0ae84bc03718e9c2cb74b8573b yaml Fix
Read more >
[Example code]-PyTorch can't use a float type but only long
I am trying to run this very basic neural network: import os; os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE" import torch import torchvision import torch.nn as ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found