Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[unicodedata] script and script_extension return names in different format

See original GitHub issue

I just noticed that script function returns long script names (e.g. “Bengali”) whereas script_extension returns short ones (e.g. {"Beng"}). This reflects the fact that in Scripts.txt they use the long names, whereas in ScriptExtensions.txt the short ones.

The problem is with the current implementation of script_extension that delegates to script whenever a codepoint has not been explicitly assigned a script extension: it returns a set containing a single script name, however using the long format which is inconsistent with the other results from script_extension function.

So, we also need to parse the PropertyValueAliases.txt and load the mapping between long and short script names, and return one or the other, but not a mix of the two.

Would you prefer that both script and script_extension functions return the long version, or that both return the short version?

Or we keep the difference, and script returns the long version (“Bengali”), whereas the script_extension returns the short name (“Beng”).

Issue Analytics

State:
Created 6 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

twardochcommented, Nov 22, 2017

I think it would be best if the script and script_extension functions always returned the Unicode script codes (4-letter names). The Unicode script codes should be a reference for other functions.

Scripts.py should get a NAMES dictionary which maps the script codes to the longer names (with _ kept, so it’s closer to UnicodeData)
a script_name function could take a script code and would return the long name (with _ replaced by ); this would in a sense mirror the unicodadata.name function which returns the character name for a Unicode codepoint; people could query Scripts.NAMES for a unique list of the script codes

0reactions

anthrotypecommented, Nov 22, 2017

Thanks Adam. Ok, I’ll have both functions return the four-character script codes and add a script_name function as well.