Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Long shot] Decoupling the library code from the language db

See original GitHub issue

I was wondering whether there is a way of decoupling the code of langcodes from the actual language db, or, more precisely, to package langcodes with a subset of the language db.

My use case is the following: I want to have the machinery provided by langcodes (in particular, the fuzzy match of languages from a user-supplied string, and the hashable Language object), but on an extremely reduced subset of languages — say only 100.

Currently, if I use langcodes in my application, I force the end-user to get 30+ MB of data from PyPI.

For example, for a project I am working on right now I coded this: https://github.com/pettarin/lachesis/blob/master/lachesis/language.py but I would much much happier if I could use langcodes (sans 30+ MB of data) instead.

One way to achieve this could be the following:

add a “download” function to the package, able to fetch a language db from Internet;
add a “register” function to “add” the data for recognized languages;
put some logic in setup.py, so that:

pip install langcodes => install langcodes "code" and download all the CLDR data (e.g. from GitHub)
pip install langcodes[nodb] => install langcodes "code" but do not download all

In the second case, the client library/application would call the “register” function at runtime, providing the data for the recognized languages, (say) the subset of the CLDR of interest to that client.

Issue Analytics

State:
Created 7 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

rspeercommented, Feb 9, 2021

This is how langcodes 3.0 works now. Thanks for the suggestion.

1reaction

rspeercommented, May 8, 2017

I am somewhat interested in this.

I believe that the big trie refactor that I just did should help reduce the “heaviness” of langcodes as a library. It now has only 7.4 MB of data, and doesn’t require a database. But there’s still more that could be done – 6.5 MB of that is in the code-to-name trie, because that one still covers every CLDR language. There is probably room for a lighter version of it.

I definitely believe there could be an advantage to having one library for parsing and matching language tags (with no concept of what the things being matched are named), and another library for names.

That advantage might only be realized if the parsing-and-matching part is really fast – perhaps if it’s in a lower-level language than Python. On the side, I’ve been working on Rust code that matches language codes and is unconcerned with names. Maybe one day it could be packaged up and wrapped in Python. But this is mostly an excuse for me to practice programming in Rust and it may not amount to anything.

Top Results From Across the Web

Alternative trie dependency? · Issue #35 · rspeer/langcodes

Below are my results and the some useful code to reproduce. ... [Long shot] Decoupling the library code from the language db #14....

[1012.6044] One-shot decoupling - Quantum Physics - arXiv

We give a criterion for decoupling in terms of two smooth entropies, one quantifying the amount of initial correlation between A and E,...

Hexagonal Architecture - Chris Fidao

Hexagonal Architecture defines conceptual layers of code responsibility, and then points out ways to decouple code between those layers.

Decoupling the Database and the Application - Byte This!

Learn how to make your database code cleaner and easier to use by creating a degree of seperation between your database and the...

Proceedings of the 59th Annual Meeting of the Association for ...

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language ...