Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unicode normalisation

See original GitHub issue

After facing the issue https://github.com/kermitt2/grobid-quantities/issues/83 at first quick glance looks like the grobid classes FeatureVectorCitation ( data come from PDF (LayoutTokens)) and FeatureVectorName (from AuthorParser) when called alone.

Not sure, though, I haven’t got time to look in too deep. 🍶 (flagged this issue as question 😅 )

Issue Analytics

State:
Created 5 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

kermitt2commented, Feb 21, 2019

Hi Luca ! It’s fine for grobid, Unicode normalisation is called on a raw reference string and on the raw author string at early stage, and anything that come from the PDF is normalised too. At the time I introduced that (summer 2017 I think!) I forgot to propagate the changes to older external sub-module like grobid-quantities.

0reactions

lfoppianocommented, Jun 17, 2019

In grobid-quantities I’ve added an additional if in the feature vector class that create the hypen feature also for minus sign… but I don’t know how many types of minus we can encounter…

Top Results From Across the Web

UAX #15: Unicode Normalization Forms

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform ...

Unicode equivalence - Wikipedia

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two ...

What on Earth is Unicode Normalization?

Unicode normalization and its four forms (NFD, NFC, NFKD, and NFKC) is the best method for normalizing all of the different Unicode ...

Unicode Normalization - HackTricks

There are 4 Normalization algorithms defined by the Unicode standard; NFC, NFD, NFKD and NFKD, each applies Canonical and Compatibility normalization techniques ...

Unicode Normalization (NFC, NFKC, NFD, NFKD ... - DenCode

Unicode normalization is the decomposition and composition of characters. Some Unicode characters have the same appearance but multiple representations.