question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bringing idna into the Python core library

See original GitHub issue

I am looking at bringing native IDNA2008 support into the Python core library, and had a long conversation with Nathaniel J. Smith (@njsmith) and Christian Heimes (@tiran) about the work required. The end goal of this work is to have Python be able to natively handle IDNs as a first class citizen, and recycle as much code use as possible.

To summarize the current conversation, the first step would be implementing a new codec in Python’s code library, and then extend the standard library to be able to natively handle IDNs in such a way that the following code snippit could work:

#!/usr/bin/env python3
import urllib
import urllib.request

req = urllib.request.Request('http://fuß.standcore.com')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode(encoding='utf-8')) 

From the conversation on Zulip, the first step would be implementing idna2008 as a new encoding codec, and then work on modifying the core library to be able to accept and interoperate with IDNs seamlessly.

I’m willing to do much of the legwork required to get code integrated into CPython. My first question is what (if any) blockers exist in implementation that would make it difficult to bring into CPython, and any tips or suggestions to help bring things forward. Right now, I’m just trying to get the ball rolling on figuring out a solid plan on hopefully having Python 3.8 be able to treat IDNs as first class citizens.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:15 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
kjdcommented, Feb 25, 2020

As a coda to this thread, having discovered it a year late due to some Github notification issues:

I would say having native IDNA 2008 support in Python’s core is probably a natural evolution of this work, if it is considered by the core maintainers to be in-scope and they are willing to maintain timely updates against new versions of Unicode. I think the status quo of having a deprecated incompatible version of IDNA in the core, and the current version not in the core, is the worst of both worlds. Either update the core to the modern spec, or deprecate the IDNA codec against the old standard from the core.

Not sure the current status of the work by @NCommander but if I can be of assistance I am happy to.

1reaction
tirancommented, Jan 13, 2019

My biggest concern with the current implementation of the idna module is the size of UTS46 mapping table. The uts46data data file has almost 200kB. Importing the module consumes about 1.5 MB of RSS:

>>> import psutil, os
>>> p = psutil.Process(os.getpid())
>>> p.memory_info()
pmem(rss=15568896, vms=237490176, shared=8200192, text=8192, lib=0, data=7569408, dirty=0)
>>> import idna.uts46data
>>> p.memory_info()
pmem(rss=17170432, vms=240320512, shared=8368128, text=8192, lib=0, data=9420800, dirty=0)
>>> (17170432 - 15568896) // 1024
1564

The uts46_remap method is fairly straight forward. It’s basically just a bisect search + couple of checks. The lookup table can be implemented in C easily and added to unicodedata module. This would avoid boxing of all ints and str as Python objects and reduce RSS.

Here is some code for https://github.com/kjd/idna/blob/master/tools/idna-data to dump the table to a header file:

def uts46_cranges(ucdata):
    last = (None, None)
    for cp in ucdata.codepoints():
        fields = cp.uts46_data
        if not fields:
            continue
        status, mapping = UTS46_STATUSES[fields[0]]
        if mapping:
            mapping = "".join(chr(int(codepoint, 16)) for codepoint in fields[1].split())
            mapping = mapping.replace("\\", "\\\\").replace("'", "\\'")
        else:
            mapping = None
        if cp.value > 255 and (status, mapping) == last:
            continue
        last = (status, mapping)

        if mapping:
            mapping = ''.join("\\x{:02X}".format(c) for c in mapping.encode('utf-8'))
            mapping = '"' + mapping + '"'
        else:
            mapping = 'NULL'

        yield "{{0x{0:X}, '{1}', {2}}}".format(cp.value, status, mapping)

def uts46_cdata(ucdata):

    yield "/* This file is automatically generated by tools/idna-data"
    yield " * vim: set fileencoding=utf-8 :\n"
    yield " * IDNA Mapping Table from UTS46."
    yield "*/ \n\n"

    yield "#include <stddef.h>"
    yield "typedef struct {long cp; char status; const char* mapping;} uts46_map_t;"
    yield "const uts46_map_t uts46_map[] = {"

    for row in uts46_cranges(ucdata):
        yield "    {0},".format(row)
    yield "};\n"

def make_cdata(args, ucdata):
    dest_dir = args.dir or '.'
    target_filename = os.path.join(dest_dir, 'uts46data.h')
    with open(target_filename, 'wb') as target:
        for line in uts46_cdata(ucdata):
            target.write((line + "\n").encode('utf-8'))
Read more comments on GitHub >

github_iconTop Results From Across the Web

idna - PyPI
Function calls from the Python built-in encodings.idna module are mapped to their IDNA 2008 equivalents using the idna.compat module. Simply substitute the ...
Read more >
idna [python-library] - Occam :: Details
Function calls from the Python built-in encodings.idna module are mapped to their IDNA 2008 equivalents using the idna.compat module. Simply substitute the ...
Read more >
some code paths in ssl and _socket still import idna ...
Importing the idna encoding has a significant time and memory cost. Therefore, the standard library tries to avoid importing it when it's ...
Read more >
pypi/idna-2.0-py2.py3-none-any.whl - airhorns/python
The library is also intended to act as a suitable drop-in replacement for the “encodings.idna” module that comes with the Python standard library...
Read more >
idna - Splunk Documentation
Portions of the codec implementation and unit tests are derived from the Python standard library, which carries the Python Software ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found