question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in HuggingFace Course "Fine-tuning a pretrained model"

See original GitHub issue

New to huggingface and just going through your newly posted course.

To reproduce

Open a google collab notebook.

Run

!pip install transformers[sentencepiece]
!pip install datasets

Then follow the steps in this chapter of the huggingface course https://huggingface.co/course/chapter3/3?fw=pt

At the step where you are told to call trainer.train() you see this error

***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    698                 if not is_tensor(value):
--> 699                     tensor = as_tensor(value)
    700 

ValueError: too many dimensions 'str'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
<ipython-input-50-3435b262f1ae> in <module>()
----> 1 trainer.train()

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1241             self.control = self.callback_handler.on_epoch_begin(args, self.state, self.control)
   1242 
-> 1243             for step, inputs in enumerate(epoch_iterator):
   1244 
   1245                 # Skip past any already trained steps if resuming training

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in __call__(self, features)
    121             max_length=self.max_length,
    122             pad_to_multiple_of=self.pad_to_multiple_of,
--> 123             return_tensors="pt",
    124         )
    125         if "label" in batch:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
   2700                 batch_outputs[key].append(value)
   2701 
-> 2702         return BatchEncoding(batch_outputs, tensor_type=return_tensors)
   2703 
   2704     def create_token_type_ids_from_sequences(

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    202         self._n_sequences = n_sequences
    203 
--> 204         self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
    205 
    206     @property

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    714                     )
    715                 raise ValueError(
--> 716                     "Unable to create tensor, you should probably activate truncation and/or padding "
    717                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    718                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Expected behavior

I guess I expected it to start training? The error message seems incorrect since padding and truncation are already set to True.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
grantdeloziercommented, Jul 19, 2021

My apologies for the spurious issue. I did a factory reset of my runtime today and was unable to replicate the error. Thanks for the awesome library!

0reactions
Boodhayanacommented, May 12, 2022

I am unable to reproduce. Are you sure you don’t have an old version of transformers in your Colab runtime for some reason? Could you run

! transformers-cli env

in a cell and paste the output here?

@sgugger

This is the ouput

- `transformers` version: 4.18.0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.6.0
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- Tensorflow version (GPU?): 2.8.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

i am trying to use ‘bert-base-cased’ for text classificationn, and i got the same error as OP

Read more comments on GitHub >

github_iconTop Results From Across the Web

What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >
Fine-tune a pretrained model - Hugging Face
You will see a warning about some of the pretrained weights not being used and some weights being randomly initialized. Don't worry, this...
Read more >
Error while training a custom pretrained model - Beginners
Hi,. I trained a model as follows: checkpoint = “bert-base-uncased” tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Read more >
Fine-tuning a pretrained model - Hugging Face
In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly...
Read more >
DataCollatorWithPadding: TypeError - Hugging Face Forums
Hi, I am following the course. I am now at Fine-tuning Fine-tuning a pretrained model - Hugging Face Course.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found