torch_xla/csrc/tensor_methods.cpp:880 : Check failed: xla::ShapeUtil::Compatible(shapes.back(), tensor_shape)
See original GitHub issueEnvironment info
transformers
version: 4.5.0- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (False)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: TPU
- Using distributed or parallel set-up in script?:
Who can help
Information
I am using BigBirdForSequenceClassification and BigBirdTokenizer for a simple text classification problem on Google Colab TPU:
The problem arises when using:
- my own modified scripts: (Script shared) If I use the BigBirdForSequenceClassification model, I start getting weird errors on TPU.
from pathlib import Path
def read_imdb_split(split_dir):
split_dir = Path(split_dir)
texts = []
labels = []
for label_dir in ["pos", "neg"]:
for text_file in (split_dir/label_dir).iterdir():
texts.append(text_file.read_text())
labels.append(0 if label_dir is "neg" else 1)
return texts, labels
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')
from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
from transformers import BigBirdTokenizer
tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird-roberta-base')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
import torch
class IMDbDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)
from transformers import BigBirdForSequenceClassification, Trainer, TrainingArguments
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.core.xla_model as xm
def main():
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=1, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training
per_device_eval_batch_size=1, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
model = BigBirdForSequenceClassification.from_pretrained('google/bigbird-roberta-base')
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
def _mp_fn(index):
main()
xmp.spawn(_mp_fn, args=(), nprocs=1, start_method='fork')
The tasks I am working on is:
- my own task or dataset: Using the IMDB Dataset for Text Classification
To reproduce
Steps to reproduce the behavior:
- Setup TPU-client on google Colab: !pip install cloud-tpu-client https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8-cp37-cp37m-linux_x86_64.whl
- Download the dataset: a. !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz b. !tar -xf aclImdb_v1.tar.gz
- Execute the given script
RuntimeError Traceback (most recent call last)
<ipython-input-14-38fb8a22e1a3> in <module>()
----> 1 xmp.spawn(_mp_fn, args=(), nprocs=1, start_method='fork')
7 frames
/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
384 pf_cfg = _pre_fork_setup(nprocs)
385 if pf_cfg.num_devices == 1:
--> 386 _start_fn(0, pf_cfg, fn, args)
387 else:
388 return torch.multiprocessing.start_processes(
/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py in _start_fn(index, pf_cfg, fn, args)
321 # environment must be fully setup before doing so.
322 _setup_replication()
--> 323 fn(gindex, *args)
324
325
<ipython-input-12-0ed5b032dbf1> in _mp_fn(index)
32
33 def _mp_fn(index):
---> 34 main()
<ipython-input-12-0ed5b032dbf1> in main()
29 )
30
---> 31 trainer.train()
32
33 def _mp_fn(index):
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1099 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
1100
-> 1101 for step, inputs in enumerate(epoch_iterator):
1102
1103 # Skip past any already trained steps if resuming training
/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py in __next__(self)
32
33 def __next__(self):
---> 34 return self.next()
35
36 def __len__(self):
/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py in next(self)
44 if self._mark_step_batch_count <= self._batches_yielded:
45 self._batches_yielded = 0
---> 46 xm.mark_step()
47 else:
48 self._batches_yielded += 1
/usr/local/lib/python3.7/dist-packages/torch_xla/core/xla_model.py in mark_step()
716 torch_xla._XLAC._xla_step_marker(
717 torch_xla._XLAC._xla_get_default_device(), [],
--> 718 wait=xu.getenv_as('XLA_SYNC_WAIT', bool, False))
719 # Only emit metrics from the first local device index, to avoid emitting the
720 # same values from different threads.
RuntimeError: Error while lowering: s64[1,2368]{1,0} aten::copysign, pad=(0, 19, 0, 0), value=0
Error: /pytorch/xla/torch_xla/csrc/helpers.h:100 : Check failed: scalar_value.isIntegral()
*** Begin stack trace ***
tensorflow::CurrentStackTrace()
torch_xla::XlaHelpers::ScalarValue(c10::Scalar, xla::PrimitiveType, xla::XlaBuilder*)
torch_xla::ir::ops::ConstantPadNd::Lower(torch_xla::ir::LoweringContext*) const
torch_xla::ir::LoweringContext::LowerNode(torch_xla::ir::Node const*)
torch_xla::ir::LoweringContext::LoweringContext(std::string const&, torch_xla::Device, absl::lts_2020_02_25::Span<torch_xla::ir::Node const* const>, std::unordered_map<torch_xla::ir::Node const*, torch_xla::ir::Util::EmitStatus, std::hash<torch_xla::ir::Node const*>, std::equal_to<torch_xla::ir::Node const*>, std::allocator<std::pair<torch_xla::ir::Node const* const, torch_xla::ir::Util::EmitStatus> > >)
torch_xla::XLATensor::Compile(std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> > const&, absl::lts_2020_02_25::Span<std::string const>, torch_xla::XLATensor::SyncTensorCollection const&, torch_xla::XLATensor::PostOrderData*)
torch_xla::XLATensor::SyncTensorsGraphInternal(std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> >*, absl::lts_2020_02_25::Span<std::string const>, torch_xla::XLATensor::SyncTensorsConfig const&)
torch_xla::XLATensor::SyncTensorsGraph(std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> >*, absl::lts_2020_02_25::Span<std::string const>, bool, bool)
torch_xla::XLATensor::SyncLiveTensorsGraph(torch_xla::Device const*, absl::lts_2020_02_25::Span<std::string const>, bool)
_PyMethodDef_RawFastCallKeywords
_PyCFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyObject_FastCall_Prepend
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
PyEval_EvalCode
_PyMethodDef_RawFastCallKeywords
_PyCFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyObject_Call_Prepend
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallDict
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallDict
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyObject_Call_Prepend
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyObject_Call_Prepend
_PyObject_FastCallKeywords
_PyMethodDef_RawFastCallDict
PyCFunction_Call
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
PyEval_EvalCode
_PyMethodDef_RawFastCallKeywords
_PyCFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallKeywords
_PyEval_EvalFrameDefault
_PyEval_EvalCodeWithName
_PyFunction_FastCallDict
_Py_UnixMain
__libc_start_main
_start
*** End stack trace ***
Scalar type not supported
Python Frames:
Similarly, once I got the following error:
RuntimeError: torch_xla/csrc/tensor_methods.cpp:880 : Check failed: xla::ShapeUtil::Compatible(shapes.back(), tensor_shape)
Expected behavior
Model training should have started but instead got the error.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
tf.TensorShape | TensorFlow v2.11.0
Represents the shape of a Tensor. ... A TensorShape represents a possibly-partial shape specification for a Tensor .
Read more >torch.Tensor.view — PyTorch 1.13 documentation
Returns a new tensor with the same data as the self tensor but of a different shape . The returned tensor shares the...
Read more >PyTorch: How to get the shape of a Tensor as a list of int
For PyTorch v1.0 and possibly above: >>> import torch >>> var = torch.tensor([[1,0], [0,1]]) # Using .size function, returns a torch.
Read more >A Static Analyzer for Detecting Tensor Shape Errors in Deep ...
static analysis, error detection, tensor shape mismatch, neural net- works, SMT solver, Python, PyTorch. 1 INTRODUCTION. 1.1 Our Goal.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @vasudevgupta7 . Thanks for the update. Please let us know when this is fixed. Need this kind of urgently. Thanks!
We didn’t check yet whether BigBird works on TPU. We should put it on the roadmap (cc @vasudevgupta7) .