Bringing idna into the Python core library
See original GitHub issueI am looking at bringing native IDNA2008 support into the Python core library, and had a long conversation with Nathaniel J. Smith (@njsmith) and Christian Heimes (@tiran) about the work required. The end goal of this work is to have Python be able to natively handle IDNs as a first class citizen, and recycle as much code use as possible.
To summarize the current conversation, the first step would be implementing a new codec in Python’s code library, and then extend the standard library to be able to natively handle IDNs in such a way that the following code snippit could work:
#!/usr/bin/env python3
import urllib
import urllib.request
req = urllib.request.Request('http://fuß.standcore.com')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode(encoding='utf-8'))
From the conversation on Zulip, the first step would be implementing idna2008 as a new encoding codec, and then work on modifying the core library to be able to accept and interoperate with IDNs seamlessly.
I’m willing to do much of the legwork required to get code integrated into CPython. My first question is what (if any) blockers exist in implementation that would make it difficult to bring into CPython, and any tips or suggestions to help bring things forward. Right now, I’m just trying to get the ball rolling on figuring out a solid plan on hopefully having Python 3.8 be able to treat IDNs as first class citizens.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:15 (1 by maintainers)
Top GitHub Comments
As a coda to this thread, having discovered it a year late due to some Github notification issues:
I would say having native IDNA 2008 support in Python’s core is probably a natural evolution of this work, if it is considered by the core maintainers to be in-scope and they are willing to maintain timely updates against new versions of Unicode. I think the status quo of having a deprecated incompatible version of IDNA in the core, and the current version not in the core, is the worst of both worlds. Either update the core to the modern spec, or deprecate the IDNA codec against the old standard from the core.
Not sure the current status of the work by @NCommander but if I can be of assistance I am happy to.
My biggest concern with the current implementation of the idna module is the size of UTS46 mapping table. The
uts46data
data file has almost 200kB. Importing the module consumes about 1.5 MB of RSS:The
uts46_remap
method is fairly straight forward. It’s basically just a bisect search + couple of checks. The lookup table can be implemented in C easily and added tounicodedata
module. This would avoid boxing of all ints and str as Python objects and reduce RSS.Here is some code for https://github.com/kjd/idna/blob/master/tools/idna-data to dump the table to a header file: