[unicodedata] script and script_extension return names in different format
See original GitHub issueI just noticed that script
function returns long script names (e.g. “Bengali”) whereas script_extension
returns short ones (e.g. {"Beng"}
). This reflects the fact that in Scripts.txt they use the long names, whereas in ScriptExtensions.txt the short ones.
The problem is with the current implementation of script_extension
that delegates to script
whenever a codepoint has not been explicitly assigned a script extension: it returns a set containing a single script name, however using the long format which is inconsistent with the other results from script_extension function.
So, we also need to parse the PropertyValueAliases.txt and load the mapping between long and short script names, and return one or the other, but not a mix of the two.
Would you prefer that both script
and script_extension
functions return the long version, or that both return the short version?
Or we keep the difference, and script
returns the long version (“Bengali”), whereas the script_extension
returns the short name (“Beng”).
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
I think it would be best if the
script
andscript_extension
functions always returned the Unicode script codes (4-letter names). The Unicode script codes should be a reference for other functions.Scripts.py
should get aNAMES
dictionary which maps the script codes to the longer names (with_
kept, so it’s closer to UnicodeData)script_name
function could take a script code and would return the long name (with_
replaced byunicodadata.name
function which returns the character name for a Unicode codepoint; people could query Scripts.NAMES for a unique list of the script codesThanks Adam. Ok, I’ll have both functions return the four-character script codes and add a
script_name
function as well.