question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

torch_xla/csrc/tensor_methods.cpp:880 : Check failed: xla::ShapeUtil::Compatible(shapes.back(), tensor_shape)

See original GitHub issue

Environment info

  • transformers version: 4.5.0
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.1+cu101 (False)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: TPU
  • Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten

Information

I am using BigBirdForSequenceClassification and BigBirdTokenizer for a simple text classification problem on Google Colab TPU:

The problem arises when using:

  • my own modified scripts: (Script shared) If I use the BigBirdForSequenceClassification model, I start getting weird errors on TPU.
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird-roberta-base')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

from transformers import BigBirdForSequenceClassification, Trainer, TrainingArguments
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.core.xla_model as xm

def main():
  training_args = TrainingArguments(
      output_dir='./results',          # output directory
      num_train_epochs=1,              # total number of training epochs
      per_device_train_batch_size=1,  # batch size per device during training
      per_device_eval_batch_size=1,   # batch size for evaluation
      warmup_steps=500,                # number of warmup steps for learning rate scheduler
      weight_decay=0.01,               # strength of weight decay
      logging_dir='./logs',            # directory for storing logs
      logging_steps=10,
  )

  model = BigBirdForSequenceClassification.from_pretrained('google/bigbird-roberta-base')

  trainer = Trainer(
      model=model,                         # the instantiated 🤗 Transformers model to be trained
      args=training_args,                  # training arguments, defined above
      train_dataset=train_dataset,         # training dataset
      eval_dataset=val_dataset             # evaluation dataset
  )

  trainer.train()

def _mp_fn(index):
  main()

xmp.spawn(_mp_fn, args=(), nprocs=1, start_method='fork')

The tasks I am working on is:

  • my own task or dataset: Using the IMDB Dataset for Text Classification

To reproduce

Steps to reproduce the behavior:

  1. Setup TPU-client on google Colab: !pip install cloud-tpu-client https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8-cp37-cp37m-linux_x86_64.whl
  2. Download the dataset: a. !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz b. !tar -xf aclImdb_v1.tar.gz
  3. Execute the given script
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-38fb8a22e1a3> in <module>()
----> 1 xmp.spawn(_mp_fn, args=(), nprocs=1, start_method='fork')

7 frames
/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    384   pf_cfg = _pre_fork_setup(nprocs)
    385   if pf_cfg.num_devices == 1:
--> 386     _start_fn(0, pf_cfg, fn, args)
    387   else:
    388     return torch.multiprocessing.start_processes(

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in _start_fn(index, pf_cfg, fn, args)
    321   # environment must be fully setup before doing so.
    322   _setup_replication()
--> 323   fn(gindex, *args)
    324 
    325 

<ipython-input-12-0ed5b032dbf1> in _mp_fn(index)
     32 
     33 def _mp_fn(index):
---> 34   main()

<ipython-input-12-0ed5b032dbf1> in main()
     29   )
     30 
---> 31   trainer.train()
     32 
     33 def _mp_fn(index):

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1099             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
   1100 
-> 1101             for step, inputs in enumerate(epoch_iterator):
   1102 
   1103                 # Skip past any already trained steps if resuming training

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py in __next__(self)
     32 
     33   def __next__(self):
---> 34     return self.next()
     35 
     36   def __len__(self):

/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py in next(self)
     44       if self._mark_step_batch_count <= self._batches_yielded:
     45         self._batches_yielded = 0
---> 46         xm.mark_step()
     47       else:
     48         self._batches_yielded += 1

/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py in mark_step()
    716   torch_xla._XLAC._xla_step_marker(
    717       torch_xla._XLAC._xla_get_default_device(), [],
--> 718       wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
    719   # Only emit metrics from the first local device index, to avoid emitting the
    720   # same values from different threads.

RuntimeError: Error while lowering: s64[1,2368]{1,0} aten::copysign, pad=(0, 19, 0, 0), value=0
Error: /pytorch/xla/torch_xla/csrc/helpers.h:100 : Check failed: scalar_value.isIntegral() 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace()
	torch_xla::XlaHelpers::ScalarValue(c10::Scalar, xla::PrimitiveType, xla::XlaBuilder*)
	
	torch_xla::ir::ops::ConstantPadNd::Lower(torch_xla::ir::LoweringContext*) const
	torch_xla::ir::LoweringContext::LowerNode(torch_xla::ir::Node const*)
	torch_xla::ir::LoweringContext::LoweringContext(std::string const&, torch_xla::Device, absl::lts_2020_02_25::Span<torch_xla::ir::Node const* const>, std::unordered_map<torch_xla::ir::Node const*, torch_xla::ir::Util::EmitStatus, std::hash<torch_xla::ir::Node const*>, std::equal_to<torch_xla::ir::Node const*>, std::allocator<std::pair<torch_xla::ir::Node const* const, torch_xla::ir::Util::EmitStatus> > >)
	torch_xla::XLATensor::Compile(std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> > const&, absl::lts_2020_02_25::Span<std::string const>, torch_xla::XLATensor::SyncTensorCollection const&, torch_xla::XLATensor::PostOrderData*)
	torch_xla::XLATensor::SyncTensorsGraphInternal(std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> >*, absl::lts_2020_02_25::Span<std::string const>, torch_xla::XLATensor::SyncTensorsConfig const&)
	torch_xla::XLATensor::SyncTensorsGraph(std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> >*, absl::lts_2020_02_25::Span<std::string const>, bool, bool)
	torch_xla::XLATensor::SyncLiveTensorsGraph(torch_xla::Device const*, absl::lts_2020_02_25::Span<std::string const>, bool)
	
	
	_PyMethodDef_RawFastCallKeywords
	_PyCFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyObject_FastCall_Prepend
	
	
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallDict
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	PyEval_EvalCode
	
	_PyMethodDef_RawFastCallKeywords
	_PyCFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyObject_Call_Prepend
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyObject_Call_Prepend
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyObject_Call_Prepend
	_PyObject_FastCallKeywords
	
	_PyMethodDef_RawFastCallDict
	PyCFunction_Call
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	PyEval_EvalCode
	
	_PyMethodDef_RawFastCallKeywords
	_PyCFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	
	
	_Py_UnixMain
	__libc_start_main
	_start
*** End stack trace ***
Scalar type not supported
Python Frames:

Similarly, once I got the following error:

RuntimeError: torch_xla/csrc/tensor_methods.cpp:880 : Check failed: xla::ShapeUtil::Compatible(shapes.back(), tensor_shape) 

Expected behavior

Model training should have started but instead got the error.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mabdullah1994commented, May 2, 2021

Hi @vasudevgupta7 . Thanks for the update. Please let us know when this is fixed. Need this kind of urgently. Thanks!

1reaction
patrickvonplatencommented, Apr 23, 2021

We didn’t check yet whether BigBird works on TPU. We should put it on the roadmap (cc @vasudevgupta7) .

Read more comments on GitHub >

github_iconTop Results From Across the Web

tf.TensorShape | TensorFlow v2.11.0
Represents the shape of a Tensor. ... A TensorShape represents a possibly-partial shape specification for a Tensor .
Read more >
torch.Tensor.view — PyTorch 1.13 documentation
Returns a new tensor with the same data as the self tensor but of a different shape . The returned tensor shares the...
Read more >
PyTorch: How to get the shape of a Tensor as a list of int
For PyTorch v1.0 and possibly above: >>> import torch >>> var = torch.tensor([[1,0], [0,1]]) # Using .size function, returns a torch.
Read more >
A Static Analyzer for Detecting Tensor Shape Errors in Deep ...
static analysis, error detection, tensor shape mismatch, neural net- works, SMT solver, Python, PyTorch. 1 INTRODUCTION. 1.1 Our Goal.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found