[Long shot] Decoupling the library code from the language db
See original GitHub issueI was wondering whether there is a way of decoupling the code of langcodes from the actual language db, or, more precisely, to package langcodes with a subset of the language db.
My use case is the following: I want to have the machinery provided by langcodes (in particular, the fuzzy match of languages from a user-supplied string, and the hashable Language object), but on an extremely reduced subset of languages — say only 100.
Currently, if I use langcodes in my application, I force the end-user to get 30+ MB of data from PyPI.
For example, for a project I am working on right now I coded this: https://github.com/pettarin/lachesis/blob/master/lachesis/language.py but I would much much happier if I could use langcodes (sans 30+ MB of data) instead.
One way to achieve this could be the following:
- add a “download” function to the package, able to fetch a language db from Internet;
- add a “register” function to “add” the data for recognized languages;
- put some logic in setup.py, so that:
pip install langcodes => install langcodes "code" and download all the CLDR data (e.g. from GitHub)
pip install langcodes[nodb] => install langcodes "code" but do not download all
In the second case, the client library/application would call the “register” function at runtime, providing the data for the recognized languages, (say) the subset of the CLDR of interest to that client.
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (5 by maintainers)
This is how langcodes 3.0 works now. Thanks for the suggestion.
I am somewhat interested in this.
I believe that the big trie refactor that I just did should help reduce the “heaviness” of langcodes as a library. It now has only 7.4 MB of data, and doesn’t require a database. But there’s still more that could be done – 6.5 MB of that is in the code-to-name trie, because that one still covers every CLDR language. There is probably room for a lighter version of it.
I definitely believe there could be an advantage to having one library for parsing and matching language tags (with no concept of what the things being matched are named), and another library for names.
That advantage might only be realized if the parsing-and-matching part is really fast – perhaps if it’s in a lower-level language than Python. On the side, I’ve been working on Rust code that matches language codes and is unconcerned with names. Maybe one day it could be packaged up and wrapped in Python. But this is mostly an excuse for me to practice programming in Rust and it may not amount to anything.