question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve the documentation for TrainingArguments.label_names, and if possible raise an error if users misinterpret this attribute like I did

See original GitHub issue

Original Issue Title: Possible typo in trainer.py: prediction_step(), forgetting to exclude loss item of outputs dict when assigning logits

Update: I determined the root cause of my error to stem from an incorrect assignment of TrainingArgument.label_names. There is not a typo in Trainer.prediction_step(), as I’ve suggested below. However there is still an issue: see my comment for elaboration.

I was using the Trainer trying to fine-tune KB-Bert-Base-Swedish-Cased for multi-class SequenceClassification, when I got a IndexError: tuple index out of range during the evaluation stage (I set up Trainer to evaluate after each Epoch).

I started PDB and paused at this line in the evaluation phase:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer.py#L1805

With the debugger, I saw that loss=None, labels=None, and logits is actually tuple with two items. The first item is the prediction loss as, and the second element is the actual output logits from the models forward pass.

I think this strange assignment of the local logits variable is coming from here, inside prediction_step:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer.py#L1933

As the outputs dict includes the loss, and “loss” is not in ignore_keys, the loss value in outputs gets baked into logits.

I’m pretty sure it’s a typo, as when I’m comparing it to a few lines above, (which is executed when has_labels=True), the similar line is:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer.py#L1922

The above links are all from Version 4.4.2, but this possible typo is still present in master:

https://github.com/huggingface/transformers/blob/9856c9213dfe9f8355fe00dd6cd0fa1ceae4fa5a/src/transformers/trainer.py#L1966

I haven’t been able to read and grasp the code too much, but it looks to me like either we’re forgetting to ignore the “loss” key in outputs, or the return statement of prediction_step should be somehaw unpacking the logits tuple, so the two variables in “logits” tuple are unpacked into loss and logits:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer.py#L1947

For clarity, this is the stacktrace of how I encounter the tuple index error from the above typo:

In the evaluation phase, prediction_loop runs over all the batches in my dev dataset. It gets the model output/prediction of each dev batch here:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer.py#L1805

Later in prediction_loop, we, concatenate each prediction batch with the previous predictions here, calling the function nested_concat:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer.py#L1810

Inside nested_concat, in the line below, new_tensors is the above mentioned “logits” tuple. https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer_pt_utils.py#L95

The above line does a recursive call to nested_concat, and we arrive in the line below.

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer_pt_utils.py#L97

Which calls this:

https://github.com/huggingface/transformers/blob/6bc89ed9295443e5a3ee236ad544101752563917/src/transformers/trainer_pt_utils.py#L58

And I get a index error, as it’s trying to index into what is actually the loss tensor.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
goerlitzcommented, Dec 15, 2021

I ran into exactly the same issue today. I was also thinking that the parameter label_names in TrainingArguments refers to data["train"].features["label"].names. The error message IndexError: tuple index out of range was not helpful at all and I only found the problem by trial and error.

Actually, I was not able to find the description for label_names in the documentation but only in the linked source code.

Besides, I don’t even understand what “The list of keys in your dictionary of inputs that correspond to the labels.” should mean.

What “dictionary of inputs” and what “list of keys”?

My dataset looks like this

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9245
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1028
    })
})

The only dictionaries I see is DatasetDict with keys “train” and “test” and each Dataset with keys “features” and “num_rows”.

It would be really helpful if the description of the parameter label_names and the error message could be improved.

0reactions
github-actions[bot]commented, Apr 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trainer - Hugging Face
Trainer. The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the...
Read more >
LightningModule - PyTorch Lightning - Read the Docs
When training using a strategy that splits data from each batch across GPUs, sometimes you might need to aggregate them on the main...
Read more >
Huggingface error while training model with custom data
I was able to replicate the issue in Colab. Alternatively, you can train the model using the code below.
Read more >
A complete Hugging Face tutorial: how to build and train a ...
To better elaborate the basic concepts, we will showcase the entire pipeline of building and training a Vision Transformer (ViT). I assume that ......
Read more >
PyTorch - Amazon SageMaker - AWS Documentation
If you use the transformers library's Trainer class, you don't need to make any additional changes to your training script. SageMaker Training Compiler ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found