Unicode normalisation
See original GitHub issueAfter facing the issue https://github.com/kermitt2/grobid-quantities/issues/83 at first quick glance looks like the grobid classes FeatureVectorCitation
( data come from PDF (LayoutTokens)) and FeatureVectorName
(from AuthorParser) when called alone.
Not sure, though, I haven’t got time to look in too deep. 🍶 (flagged this issue as question 😅 )
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
UAX #15: Unicode Normalization Forms
Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform ...
Read more >Unicode equivalence - Wikipedia
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two ...
Read more >What on Earth is Unicode Normalization?
Unicode normalization and its four forms (NFD, NFC, NFKD, and NFKC) is the best method for normalizing all of the different Unicode ...
Read more >Unicode Normalization - HackTricks
There are 4 Normalization algorithms defined by the Unicode standard; NFC, NFD, NFKD and NFKD, each applies Canonical and Compatibility normalization techniques ...
Read more >Unicode Normalization (NFC, NFKC, NFD, NFKD ... - DenCode
Unicode normalization is the decomposition and composition of characters. Some Unicode characters have the same appearance but multiple representations.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi Luca ! It’s fine for grobid, Unicode normalisation is called on a raw reference string and on the raw author string at early stage, and anything that come from the PDF is normalised too. At the time I introduced that (summer 2017 I think!) I forgot to propagate the changes to older external sub-module like grobid-quantities.
In grobid-quantities I’ve added an additional if in the feature vector class that create the hypen feature also for minus sign… but I don’t know how many types of minus we can encounter…