Error Using SpaCy in Async Threads ([E050] Can't find model 'en_core_web_md.vectors')
See original GitHub issueThe Problem
I am loading the en_core_web_md
spaCy model in a main thread and passing it as argument to async threads. Then I get the error message OSError: [E050] Can't find model 'en_core_web_md.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
when I try doc=nlp(model)
in one of those threads.
As far as I’m concerned spaCy should be thread safe, and this error message only occurs when using models that use word vectors. Indeed, the error does not arise when using the en_core_web_sm
model. But it persists even when loading the en_core_web_md
or en_core_web_lg
models with the vectors=False
parameter set.
Code Example:
spacy_nlp = spacy.load("en_core_web_md", vectors=False)
for i, file in enumerate(files):
application.pool.apply_async(
my_function,
args=[file, i, spacy_nlp],
kwds=configs,
callback=success_callback_factory,
error_callback=error_callback_factory)
Observations
This error is not present when I try to use en_core_web_sm
(which has no word vectors).
However, it still occurs when I load the model with nlp = spacy.load(spacy_model,vectors=False)
.
I get the same problem when trying to use the large model.
Environment
- Operating System: Windows 10
- Python Version Used: 3.7.6
- spaCy Version Used: ~(2.2.3)~ 2.2.4 (edited: the problem persists even after upgrading spaCy)
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (7 by maintainers)
Top GitHub Comments
In terms of the speed,
spawn
in the problem more than spacy. There are some known issues with the vocab/vectors (which you just solved), but otherwise it’s just very slow to start child processes with spawn, which is the only option for windows. In linux the default isfork
, which is much faster to start and doesn’t have the vectors issues because more of the global state is shared with the child processes. One the child processes have started, I think the differences between fork and spawn may not be that large, though. Try it out with a longer-running scenario to see how well it works?nlp.pipe()
will be much faster if you’re processing multiple texts in one request. If it’s just one text at a time, thennlp()
won’t be any different fromnlp.pipe()
, though. It depends on the model size and how long your texts / batches are, so I’d run some timing tests there, too, to see what works best.For the global vector state issue, I don’t think there’s going to be a better solution in spacy v2 than the
load_nlp
workaround above, but there should be improvements in v3.Hey, what worked for me was that solution from @adrianeboyd:
Basically I had to pass
load_nlp.VECTORS
from the original nlp load place to the new spawned function, and restore it in the method withload_nlp.VECTORS = vectors
. The linked code shows how to do it.