question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[unicodedata] script and script_extension return names in different format

See original GitHub issue

I just noticed that script function returns long script names (e.g. “Bengali”) whereas script_extension returns short ones (e.g. {"Beng"}). This reflects the fact that in Scripts.txt they use the long names, whereas in ScriptExtensions.txt the short ones.

The problem is with the current implementation of script_extension that delegates to script whenever a codepoint has not been explicitly assigned a script extension: it returns a set containing a single script name, however using the long format which is inconsistent with the other results from script_extension function.

So, we also need to parse the PropertyValueAliases.txt and load the mapping between long and short script names, and return one or the other, but not a mix of the two.

Would you prefer that both script and script_extension functions return the long version, or that both return the short version?

Or we keep the difference, and script returns the long version (“Bengali”), whereas the script_extension returns the short name (“Beng”).

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
twardochcommented, Nov 22, 2017

I think it would be best if the script and script_extension functions always returned the Unicode script codes (4-letter names). The Unicode script codes should be a reference for other functions.

  • Scripts.py should get a NAMES dictionary which maps the script codes to the longer names (with _ kept, so it’s closer to UnicodeData)
  • a script_name function could take a script code and would return the long name (with _ replaced by ); this would in a sense mirror the unicodadata.name function which returns the character name for a Unicode codepoint; people could query Scripts.NAMES for a unique list of the script codes
0reactions
anthrotypecommented, Nov 22, 2017

Thanks Adam. Ok, I’ll have both functions return the four-character script codes and add a script_name function as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

unicodedata — fontTools Documentation - Read the Docs
Return the long, human-readable script name given a four-letter Unicode script code. If no matching name is found, a KeyError is raised by...
Read more >
unicodedata — Unicode Database — Python 3.11.1 ...
Returns the name assigned to the character chr as a string. If no name is defined, default is returned, or, if not given,...
Read more >
UAX #44: Unicode Character Database
This document has been reviewed by Unicode members and other interested ... is used to drive the PDF formatting of the Unicode code...
Read more >
How can I determine a Unicode character from its name in ...
The unicodedata module uses the UnicodeData.txt v5.2.0 Unicode database. Notice that the control characters are all assigned the name ...
Read more >
Unicode properties and regular expressions - NUMA
Files in the UCD may use different formats. But many starts with ... Scripts.txt, ScriptExtensions.txt, Emoji-data.txt just to name a few.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found