question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

spaCy-models: Please Consider Distributing via PyPi

See original GitHub issue

Feature Summary

Release spaCy models via PyPi

Feature Description

We use spaCy in an enterprise setting. For security, the hosts that build production docker images cannot connect to the external internet. This introduces complexity when trying to install packages like spacy-models, where the recommended installation method is to either install from a Github release (requiring a connection to github.com) or to vendor the package (avoids networking issues, but bloats individual repos).

Publishing the models through PyPi would be beneficial in that spacy-models would no longer be installed differently than other packages & would also allow us to benefit from the security that PyPi provides (e.g. ability to mirror the package index on our internal network, assurance that package versions are immutable, etc.).

Perhaps you could start with adding the small models to PyPi, as they would not run into default package size restrictions. PyPi allows package authors to file a request increasing the maximum allowable size of the package: the increased limits would easily support the medium models. There is also precedent for setting size limits that would allow for distributing the large models as well.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Aug 25, 2020

Having the packages download the models from github wouldn’t help with the security restrictions mentioned above.

The model packages are standard pip packages with longer names like en_core_web_sm. If you install the package from a downloaded .tar.gz from spacy-models or with spacy download en_core_web_sm you’ll just have en_core_web_sm and no en shortcut.

In contrast, spacy download en does several things: 1) map the shortcut name en to the package en_core_web_sm, 2) download and install the en_core_web_sm package with pip, 3) add a symlink from en to en_core_web_sm. The symlink is a separate step that doesn’t involve pip or how the model package is installed.

We’ve realized that the symlinks cause a number of headaches, so we don’t recommend them anymore and are planning to remove them in spacy v3. Then you will only be able to use the full package names like en_core_web_sm with spacy.load().

0reactions
github-actions[bot]commented, Nov 1, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Install spaCy · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
spacy-model-manager - PyPI
Command line utility to view, install, and upgrade spaCy models. ... Xcode CLI tools, and Python 3 installed using Homebrew (or your preferred...
Read more >
How to Package spaCy Models (Even with Custom ... - YouTube
In this video, I show you how to package a trained spaCy model. For normal models with standard factories, such as ner, this...
Read more >
No module named 'en_core_web_sm' - Streamlit
I am deploying an app using streamlit share and getting this error. ... pip install https://github.com/explosion/spacy-models/releases/ ...
Read more >
Enterprise-class NLP with spaCy v3 - Domino Data Lab
spaCy is a python library that provides capabilities to conduct advanced natural language processing analysis and build models that can underpin document ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found