question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add NLP-specific metrics

See original GitHub issue

@mattdangerw and the keras-nlp team:

For standard classification metrics (AUC, F1, Precision, Recall, Accuracy, etc.), keras.metrics can be used. But there are several NLP-specific metrics which can be implemented here, i.e., we can expose native APIs for these metrics.

I would like to take this up. I can start with the popular ones first and open PRs. Let me know if this is something the team is looking to add!

I’ve listed a few metrics (this list is, by no means, comprehensive):

  • Perplexity

  • ROUGE paper Pretty standard metric for text generation. We can implement all variations: ROUGE-N, ROUGE-L, ROUGE-W, etc.

  • BLEU paper Another standard text generation metric. Note: We can also implement SacreBleu.

  • BertScore paper, code

  • Bleurt paper, code

  • (character n-gram F-score) chrF and chrF++ paper, code

  • COMET paper, code

  • Character Error Rate, Word Error Rate, etc. paper

  • Pearson Coefficient and Spearman Coefficient Looks like keras.metrics does not have these two metrics. They are not NLP-specific metrics…so, maybe, implementing them in Keras is better than implementing them here.

Thank you!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:16 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
abheesht17commented, Mar 13, 2022

@aflah02, good point. Will do!

1reaction
aflah02commented, Mar 13, 2022

@abheesht17 I’d suggest adding perplexity as well as it’s one of the trickier metrics to use. Especially since it often gives inconsistent results and hugely varying results (in orders of magnitude) across different implementations by different existing libraries in my experience

Read more comments on GitHub >

github_iconTop Results From Across the Web

Evaluate predictions - Hugging Face
In addition to metrics, you can find more tools for evaluating models and datasets. Datasets provides various common and NLP-specific metrics for you...
Read more >
Adding The Evaluation Metrics For Image Captioning
Still, I'm looking for a way to know how good this works with my dataset. Is there any chance to implement the Evaluation...
Read more >
A global analysis of metrics used for measuring performance ...
Measuring the performance of natural language processing models is challenging. Traditionally used metrics, such as BLEU and ROUGE, orig-.
Read more >
[PAPER] A critical analysis of metrics used for measuring ...
Other NLP-specific metrics that can be seen as special variants of precision and recall include the BLEU, NIST, ROUGE and. METEOR scores. Due...
Read more >
Measuring Reproducibility in PyTorch - TorchMetrics
(Arxiv, n.d.) recently added a code and data section that links both official and community ... 2020) for calculating NLP-specific metrics.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found