[Possible Bug] Getting IndexError: list index out of range when fine-tuning custom LM model
See original GitHub issueEnvironment info
`transformers` version: 4.3.3
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.7.1+cu101 (False)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?: True/False
Using distributed or parallel set-up in script?: False
Who can help
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- tokenizers: @LysandreJik
- trainer: @sgugger
Information
Model I am using (Bert, XLNet …): LongFormer
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Hi, I am trying to train an LM model on a custom dataset (which is simply text over multiple lines). My choice was the Longformer, and I am using the exact same code provided officially with just a few modifications.
When I fine-tune it on a custom dataset, I am getting this error:-
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-54-2f2d9c2c00fc> in <module>()
45 )
46
---> 47 train_results = trainer.train()
6 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1032 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
1033
-> 1034 for step, inputs in enumerate(epoch_iterator):
1035
1036 # Skip past any already trained steps if resuming training
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
555 def _next_data(self):
556 index = self._next_index() # may raise StopIteration
--> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
558 if self._pin_memory:
559 data = _utils.pin_memory.pin_memory(data)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
<ipython-input-53-5e4959dcf50c> in __getitem__(self, idx)
7
8 def __getitem__(self, idx):
----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
10 item['labels'] = torch.tensor(self.labels[idx])
11 return item
<ipython-input-53-5e4959dcf50c> in <dictcomp>(.0)
7
8 def __getitem__(self, idx):
----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
10 item['labels'] = torch.tensor(self.labels[idx])
11 return item
IndexError: list index out of range
Most probably it is a tokenization problem, but I can’t seem to locate it. I ensured that the tokenizer in the LM does accept an appropriate length (even if it is quite bigger than I want):
tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)
For fine-tuning, I ensured that it would truncate&pad, though none of my data samples are long enough to truncate:
train_encodings = tokenizer(list(train_text), truncation=True, padding=True, max_length=3500)
val_encodings = .....
Finally, I tried with some dummy data with fixed length like this:
train_text = ['a', 'b']
val_text = ['c', 'd']
Which rules out most tokenization errors. I am fine-tuning in accordance to official scripts - something I have done before. the LM looks good to me and tokenizes individually as well, so I have no reason to suspect it.
I am attaching my LM code:-
!pip install -q git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files='./NYA.txt', vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
!mkdir ny_model
tokenizer.save_model("ny_model")
from transformers import LongformerConfig
config = LongformerConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=2,
num_hidden_layers=1,
type_vocab_size=1,
)
from transformers import LongformerTokenizerFast
tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)
from transformers import LongformerForMaskedLM
model = LongformerForMaskedLM(config=config)
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./NYA.txt",
block_size=128,
)
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=2,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
learning_rate=1e-5,
logging_steps=50,
fp16=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator
)
trainer.train()
and as said again, the fine-tuning part is again just like the official scripts, save the tokenizer arguments and some simple training args.
I believe that this code with a simple dummy dataset could reproduce the bug. I can provide further help on the gist if someone can create one for full reproducibility. If there is some idiotic mistake I have made, please don’t hesitate to point that out.
Any Ideas what the problem might be?
Cheers
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top GitHub Comments
You should use the forums instead of stack overflow, there will be more people to answer your questions there.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.