question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Possible Bug] Getting IndexError: list index out of range when fine-tuning custom LM model

See original GitHub issue

Environment info

`transformers` version: 4.3.3
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.7.1+cu101 (False)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?: True/False
Using distributed or parallel set-up in script?: False

Who can help

Information

Model I am using (Bert, XLNet …): LongFormer

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Hi, I am trying to train an LM model on a custom dataset (which is simply text over multiple lines). My choice was the Longformer, and I am using the exact same code provided officially with just a few modifications.

When I fine-tune it on a custom dataset, I am getting this error:-

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-54-2f2d9c2c00fc> in <module>()
     45     )
     46 
---> 47 train_results = trainer.train()

6 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1032             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
   1033 
-> 1034             for step, inputs in enumerate(epoch_iterator):
   1035 
   1036                 # Skip past any already trained steps if resuming training

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

<ipython-input-53-5e4959dcf50c> in __getitem__(self, idx)
      7 
      8     def __getitem__(self, idx):
----> 9         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
     10         item['labels'] = torch.tensor(self.labels[idx])
     11         return item

<ipython-input-53-5e4959dcf50c> in <dictcomp>(.0)
      7 
      8     def __getitem__(self, idx):
----> 9         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
     10         item['labels'] = torch.tensor(self.labels[idx])
     11         return item

IndexError: list index out of range

Most probably it is a tokenization problem, but I can’t seem to locate it. I ensured that the tokenizer in the LM does accept an appropriate length (even if it is quite bigger than I want):

tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)

For fine-tuning, I ensured that it would truncate&pad, though none of my data samples are long enough to truncate:

train_encodings = tokenizer(list(train_text), truncation=True, padding=True, max_length=3500)
val_encodings = .....

Finally, I tried with some dummy data with fixed length like this:

train_text = ['a', 'b']
val_text = ['c', 'd']

Which rules out most tokenization errors. I am fine-tuning in accordance to official scripts - something I have done before. the LM looks good to me and tokenizes individually as well, so I have no reason to suspect it.

I am attaching my LM code:-

!pip install -q git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files='./NYA.txt', vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

!mkdir ny_model
tokenizer.save_model("ny_model")

from transformers import LongformerConfig

config = LongformerConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=2,
    num_hidden_layers=1,
    type_vocab_size=1,
)

from transformers import LongformerTokenizerFast

tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)

from transformers import LongformerForMaskedLM

model = LongformerForMaskedLM(config=config)

%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./NYA.txt",
    block_size=128,
)
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    learning_rate=1e-5,
    logging_steps=50,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=data_collator
)
trainer.train()

and as said again, the fine-tuning part is again just like the official scripts, save the tokenizer arguments and some simple training args.

I believe that this code with a simple dummy dataset could reproduce the bug. I can provide further help on the gist if someone can create one for full reproducibility. If there is some idiotic mistake I have made, please don’t hesitate to point that out.

Any Ideas what the problem might be?

Cheers

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Apr 9, 2021

You should use the forums instead of stack overflow, there will be more people to answer your questions there.

0reactions
github-actions[bot]commented, May 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting IndexError: list index out of range when fine-tuning
Hi everyone! I want to fine-tune my pre-trained Longformer model and am getting this error:-
Read more >
IndexError while fine tuning model · Issue #340 · cdqa-suite ...
Hi, I am trying to fine tune the Bert model on my custom dataset. ... answer_offset + answer_length - 1 IndexError: list index...
Read more >
List Index Out of Range – Python Error Message Solved
You'll get the Indexerror: list index out of range error when iterating through a list and trying to access an item that doesn't...
Read more >
Python IndexError: List Index Out of Range [Easy Fix]
To solve the “IndexError: list index out of range”, avoid do not access a non-existing list index. For example, my_list[5] causes an error...
Read more >
List index out of range when saving finetuned Tensorflow ...
But when I try to save the model it stops with the error "IndexError: list index out of range". I'm using Google Colab...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found