question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode normalisation

See original GitHub issue

After facing the issue https://github.com/kermitt2/grobid-quantities/issues/83 at first quick glance looks like the grobid classes FeatureVectorCitation ( data come from PDF (LayoutTokens)) and FeatureVectorName (from AuthorParser) when called alone.

Not sure, though, I haven’t got time to look in too deep. 🍶 (flagged this issue as question 😅 )

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
kermitt2commented, Feb 21, 2019

Hi Luca ! It’s fine for grobid, Unicode normalisation is called on a raw reference string and on the raw author string at early stage, and anything that come from the PDF is normalised too. At the time I introduced that (summer 2017 I think!) I forgot to propagate the changes to older external sub-module like grobid-quantities.

0reactions
lfoppianocommented, Jun 17, 2019

In grobid-quantities I’ve added an additional if in the feature vector class that create the hypen feature also for minus sign… but I don’t know how many types of minus we can encounter…

Read more comments on GitHub >

github_iconTop Results From Across the Web

UAX #15: Unicode Normalization Forms
Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform ...
Read more >
Unicode equivalence - Wikipedia
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two ...
Read more >
What on Earth is Unicode Normalization?
Unicode normalization and its four forms (NFD, NFC, NFKD, and NFKC) is the best method for normalizing all of the different Unicode ...
Read more >
Unicode Normalization - HackTricks
There are 4 Normalization algorithms defined by the Unicode standard; NFC, NFD, NFKD and NFKD, each applies Canonical and Compatibility normalization techniques ...
Read more >
Unicode Normalization (NFC, NFKC, NFD, NFKD ... - DenCode
Unicode normalization is the decomposition and composition of characters. Some Unicode characters have the same appearance but multiple representations.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found