Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add NLP-specific metrics

See original GitHub issue

@mattdangerw and the keras-nlp team:

For standard classification metrics (AUC, F1, Precision, Recall, Accuracy, etc.), keras.metrics can be used. But there are several NLP-specific metrics which can be implemented here, i.e., we can expose native APIs for these metrics.

I would like to take this up. I can start with the popular ones first and open PRs. Let me know if this is something the team is looking to add!

I’ve listed a few metrics (this list is, by no means, comprehensive):

Perplexity
ROUGE paper Pretty standard metric for text generation. We can implement all variations: ROUGE-N, ROUGE-L, ROUGE-W, etc.
BLEU paper Another standard text generation metric. Note: We can also implement SacreBleu.
BertScore paper, code
Bleurt paper, code
(character n-gram F-score) chrF and chrF++ paper, code
COMET paper, code
Character Error Rate, Word Error Rate, etc. paper
Pearson Coefficient and Spearman Coefficient Looks like keras.metrics does not have these two metrics. They are not NLP-specific metrics…so, maybe, implementing them in Keras is better than implementing them here.

Thank you!

Issue Analytics

State:
Created 2 years ago
Comments:16 (4 by maintainers)

Top GitHub Comments

1reaction

abheesht17commented, Mar 13, 2022

@aflah02, good point. Will do!

1reaction

aflah02commented, Mar 13, 2022

@abheesht17 I’d suggest adding perplexity as well as it’s one of the trickier metrics to use. Especially since it often gives inconsistent results and hugely varying results (in orders of magnitude) across different implementations by different existing libraries in my experience

Top Results From Across the Web

Evaluate predictions - Hugging Face

In addition to metrics, you can find more tools for evaluating models and datasets. Datasets provides various common and NLP-specific metrics for you...

Adding The Evaluation Metrics For Image Captioning

Still, I'm looking for a way to know how good this works with my dataset. Is there any chance to implement the Evaluation...

A global analysis of metrics used for measuring performance ...

Measuring the performance of natural language processing models is challenging. Traditionally used metrics, such as BLEU and ROUGE, orig-.

[PAPER] A critical analysis of metrics used for measuring ...

Other NLP-specific metrics that can be seen as special variants of precision and recall include the BLEU, NIST, ROUGE and. METEOR scores. Due...

Measuring Reproducibility in PyTorch - TorchMetrics

(Arxiv, n.d.) recently added a code and data section that links both official and community ... 2020) for calculating NLP-specific metrics.