Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in HuggingFace Course "Fine-tuning a pretrained model"

See original GitHub issue

New to huggingface and just going through your newly posted course.

To reproduce

Open a google collab notebook.

Run

!pip install transformers[sentencepiece]
!pip install datasets

Then follow the steps in this chapter of the huggingface course https://huggingface.co/course/chapter3/3?fw=pt

At the step where you are told to call trainer.train() you see this error

***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    698                 if not is_tensor(value):
--> 699                     tensor = as_tensor(value)
    700 

ValueError: too many dimensions 'str'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
<ipython-input-50-3435b262f1ae> in <module>()
----> 1 trainer.train()

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1241             self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control)
   1242 
-> 1243             for step, inputs in enumerate(epoch_iterator):
   1244 
   1245                 # Skip past any already trained steps if resuming training

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in __call__(self, features)
    121             max_length=self.max_length,
    122             pad_to_multiple_of=self.pad_to_multiple_of,
--> 123             return_tensors="pt",
    124         )
    125         if "label" in batch:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2700                 batch_outputs[key].append(value)
   2701 
-> 2702         return BatchEncoding(batch_outputs, tensor_type=return_tensors)
   2703 
   2704     def create_token_type_ids_from_sequences(

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    202         self._n_sequences = n_sequences
    203 
--> 204         self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
    205 
    206     @property

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    714                     )
    715                 raise ValueError(
--> 716                     "Unable to create tensor, you should probably activate truncation and/or padding "
    717                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    718                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Expected behavior

I guess I expected it to start training? The error message seems incorrect since padding and truncation are already set to True.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

grantdeloziercommented, Jul 19, 2021

My apologies for the spurious issue. I did a factory reset of my runtime today and was unable to replicate the error. Thanks for the awesome library!

0reactions

Boodhayanacommented, May 12, 2022

I am unable to reproduce. Are you sure you don’t have an old version of transformers in your Colab runtime for some reason? Could you run
! transformers-cli env
in a cell and paste the output here?

@sgugger

This is the ouput

- `transformers` version: 4.18.0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.6.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

i am trying to use ‘bert-base-cased’ for text classificationn, and i got the same error as OP