question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiclass evaluation not working

See original GitHub issue

Hello,

I am new to Transformers library and I’m trying to do Sequence Classification. I have 24 labels and I am getting the following error:

ValueError: Target is multiclass but average=‘binary’. Please choose another average setting, one of [None, ‘micro’, ‘macro’, ‘weighted’].

Even though I already added the averaging method as such:

metric = load_metric('precision', average='weighted')

Would someone kindly point me towards whatever I’m doing wrong? I’m able to fine-tune the pre-trained BERT model if I use ‘accuracy’ as the metric. But somehow it’s not accepting my average='weighted' argument.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

4reactions
dvenicommented, Apr 12, 2022

Hi there!

You are passing the average argument when you load the metric, instead, you should pass it to the compute method like this:

metric = load_metric('precision')
metric.compute(predictions=[0,1,2,3,4,4,4,4], references=[2,2,2,3,4,1,1,4], average="weighted")

Output:
----------------------
>>> {'precision': 0.625}

Hope this helps!

2reactions
puifaiscommented, Apr 7, 2022

No problem. Sorry, I wasn’t sure if this was a bug or my own mistake so I didn’t use the bug template. Here you go:

Environment info

  • transformers version: 4.17.0
  • Platform: Linux-4.14.252-131.483.amzn1.x86_64-x86_64-with-glibc2.9
  • Python version: 3.6.13
  • PyTorch version (GPU?): 1.10.2+cu102 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

  1. @sgugger
  2. @LysandreJik
  3. @sgugger

Information

Model I am using (Bert, XLNet …): bert-based-cased

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Load own data set using Dataset.from_pandas(my_data)
  2. Tokenize it with AutoTokenizer.from_pretrained('bert-base-cased')
  3. Create training arguments, metric, and Trainer object to start training.
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=24)
training_args = TrainingArguments(output_dir='./checkpoints/my_model', evaluation_strategy="epoch")

metric = load_metric('precision', average='weighted')
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  compute_metrics=compute_metrics)
trainer.train()

and I got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-5aef28bcb00d> in <module>
     14                   eval_dataset=eval_dataset,
     15                   compute_metrics=compute_metrics)
---> 16 trainer.train()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1488 
   1489             self.control = self.callback_handler.on_epoch_end(args, self.state, self.control)
-> 1490             self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1491 
   1492             if DebugOption.TPU_METRICS_DEBUG in self.args.debug:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1600         metrics = None
   1601         if self.control.should_evaluate:
-> 1602             metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
   1603             self._report_to_hp_search(trial, epoch, metrics)
   1604 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix)
   2262             prediction_loss_only=True if self.compute_metrics is None else None,
   2263             ignore_keys=ignore_keys,
-> 2264             metric_key_prefix=metric_key_prefix,
   2265         )
   2266 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   2503         # Metrics!
   2504         if self.compute_metrics is not None and all_preds is not None and all_labels is not None:
-> 2505             metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
   2506         else:
   2507             metrics = {}

<ipython-input-46-5aef28bcb00d> in compute_metrics(eval_pred)
      7     logits, labels = eval_pred
      8     predictions = np.argmax(logits, axis=-1)
----> 9     return metric.compute(predictions=predictions, references=labels)
     10 
     11 trainer = Trainer(model=model,

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/datasets/metric.py in compute(self, predictions, references, **kwargs)
    428             inputs = {input_name: self.data[input_name] for input_name in self.features}
    429             with temp_seed(self.seed):
--> 430                 output = self._compute(**inputs, **compute_kwargs)
    431 
    432             if self.buf_writer is not None:

~/.cache/huggingface/modules/datasets_modules/metrics/precision/bfadb1cf35fe89242263de7dc028b248827c08ba075659c0e812d0fc6e5237c9/precision.py in _compute(self, predictions, references, labels, pos_label, average, sample_weight)
    116     def _compute(self, predictions, references, labels=None, pos_label=1, average="binary", sample_weight=None):
    117         score = precision_score(
--> 118             references, predictions, labels=labels, pos_label=pos_label, average=average, sample_weight=sample_weight
    119         )
    120         return {"precision": float(score) if score.size == 1 else score}

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sklearn/metrics/_classification.py in precision_score(y_true, y_pred, labels, pos_label, average, sample_weight, zero_division)
   1660                                                  warn_for=('precision',),
   1661                                                  sample_weight=sample_weight,
-> 1662                                                  zero_division=zero_division)
   1663     return p
   1664 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sklearn/metrics/_classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight, zero_division)
   1463         raise ValueError("beta should be >=0 in the F-beta score")
   1464     labels = _check_set_wise_labels(y_true, y_pred, average, labels,
-> 1465                                     pos_label)
   1466 
   1467     # Calculate tp_sum, pred_sum, true_sum ###

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sklearn/metrics/_classification.py in _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
   1294             raise ValueError("Target is %s but average='binary'. Please "
   1295                              "choose another average setting, one of %r."
-> 1296                              % (y_type, average_options))
   1297     elif pos_label not in (None, 1):
   1298         warnings.warn("Note that pos_label (set to %r) is ignored when "

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

Expected behavior

I expect the model to complete training without an error. I am able to do this if I used metric = load_metric('accuracy') but not with precision, recall, or f1.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Evaluating multiclass imbalanced problem per class
For a multiclass imbalanced problem, accuracy is not a good metric to evaluate model performance. Equally, accuracy is a global metric, ...
Read more >
Comprehensive Guide on Multiclass Classification Metrics
Using these metrics, you can evaluate the performance of any classifier and compare them to each other. Here is a final cheat-sheet to...
Read more >
Evaluation measures for multiclass problems - Gabriele Lanaro
Evaluation measures for multiclass problems · Confusion matrix · Precision · Recall · F1-score · Micro and macro averages · Accuracy · Cross...
Read more >
Evaluating Multi-Class Classifiers | by Harsha Goonewardana
Best practice methodology for model selection for a multi-class classification problem is to use a basket of metrics. Then the appropriate ...
Read more >
Performance Measures for Multi-Class Problems
How to calculate performance for multi-class problems? ... For evaluate a scoring classifier at multiple cutoffs, these quantities can be ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found