question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error Using SpaCy in Async Threads ([E050] Can't find model 'en_core_web_md.vectors')

See original GitHub issue

The Problem

I am loading the en_core_web_md spaCy model in a main thread and passing it as argument to async threads. Then I get the error message OSError: [E050] Can't find model 'en_core_web_md.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. when I try doc=nlp(model) in one of those threads.

As far as I’m concerned spaCy should be thread safe, and this error message only occurs when using models that use word vectors. Indeed, the error does not arise when using the en_core_web_sm model. But it persists even when loading the en_core_web_md or en_core_web_lg models with the vectors=False parameter set.

Code Example:

spacy_nlp =  spacy.load("en_core_web_md", vectors=False)
for i, file in enumerate(files):
        application.pool.apply_async(
            my_function,
            args=[file, i, spacy_nlp],
            kwds=configs,
            callback=success_callback_factory,
            error_callback=error_callback_factory)

Observations

This error is not present when I try to use en_core_web_sm (which has no word vectors). However, it still occurs when I load the model with nlp = spacy.load(spacy_model,vectors=False). I get the same problem when trying to use the large model.

Environment

  • Operating System: Windows 10
  • Python Version Used: 3.7.6
  • spaCy Version Used: ~(2.2.3)~ 2.2.4 (edited: the problem persists even after upgrading spaCy)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
adrianeboydcommented, Apr 24, 2020

In terms of the speed, spawn in the problem more than spacy. There are some known issues with the vocab/vectors (which you just solved), but otherwise it’s just very slow to start child processes with spawn, which is the only option for windows. In linux the default is fork, which is much faster to start and doesn’t have the vectors issues because more of the global state is shared with the child processes. One the child processes have started, I think the differences between fork and spawn may not be that large, though. Try it out with a longer-running scenario to see how well it works?

nlp.pipe() will be much faster if you’re processing multiple texts in one request. If it’s just one text at a time, then nlp() won’t be any different from nlp.pipe(), though. It depends on the model size and how long your texts / batches are, so I’d run some timing tests there, too, to see what works best.

For the global vector state issue, I don’t think there’s going to be a better solution in spacy v2 than the load_nlp workaround above, but there should be improvements in v3.

1reaction
gsevrodriguescommented, Oct 20, 2020

Hi I am using en_core_web_lg 2.3.1. got the same problem. Can you share with me your solutions? Thanks,

Hey, what worked for me was that solution from @adrianeboyd:

Ah, I realized what I overlooked initially. We fixed this for nlp.pipe(), but if you’re using nlp with your own multiprocessing that uses spawn, it’s still not going to work. You’ll need to basically do the same thing as in that patch in your own method: pass load_nlp.VECTORS and restore it in the method with load_nlp.VECTORS = vectors:

https://github.com/explosion/spaCy/pull/5081/files

(Be aware that multiprocessing with spawn and larger spacy models is probably going to be rather slow.)

Basically I had to pass load_nlp.VECTORS from the original nlp load place to the new spawned function, and restore it in the method with load_nlp.VECTORS = vectors. The linked code shows how to do it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

spacy Can't find model 'en_core_web_sm' on windows 10 and ...
Initially I downloaded two en packages using following statements in anaconda prompt. python -m spacy download en_core_web_lg python -m ...
Read more >
English · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Error while loading Spacy's "en_core_web_md" - Microsoft Q&A
The error I get is "OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path...
Read more >
Install and use spacy models - ProjectPro
Click here if you want to learn how to install and use spacy models. ProjectPro's spacy download model is easy to understand and...
Read more >
Learning Spacy Basics - Kaggle
import spacy # Load a larger model with vectors nlp ... E050.format(name=name)) 140 141 OSError: [E050] Can't find model 'en_core_web_md'.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found