Weird vectors loading bug when trying to distribute NLP processing with Dask
See original GitHub issueDear spaCy team,
I am struggling with a bug when trying to distribute NLP processing with Dask.
Here is a small exemple of code which trigger the bug:
How to reproduce the behaviour
import dask.distributed
import dask.delayed
import spacy
text = '''
Abstract Though necessary to slow the spread of the novel Coronavirus (Covid-19), actions such as social-distancing,
sheltering in-place, restricted travel, and closures of key community foundations are likely to dramatically increase
the risk for family violence around the globe. In fact many countries are already indicating a dramatic increase in
reported cases of domestic violence. While no clear precedent for the current crisis exists in academic literature,
exploring the impact of natural disasters on family violence reports may provide important insight for family violence
victim-serving professionals. Improving collaborations between human welfare and animal welfare agencies, expanding
community partnerships, and informing the public of the great importance of reporting any concerns of abuse are all
critical at this time.'''
def extract_tokens(nlp):
return [tok.lemma_ for tok in nlp(text) if not tok.is_punct]
# Original code which works
if __name__ == '__main__':
nlp = spacy.load('en_core_sci_lg')
tokens = extract_tokens(nlp)
print('Expected result:')
print(' '.join(tokens))
print()
# Distributed code which trigger spaCy vector loading bug
if __name__ == '__main__':
client = dask.distributed.Client(n_workers=1, processes=True)
nlp = spacy.load('en_core_sci_lg')
nlp_future = client.scatter(nlp)
lambda_future = client.submit(extract_tokens, nlp_future)
print('Try to distribute NLP processing...')
tokens = client.gather(lambda_future)
print(' '.join(tokens))
Here is the output and the stack trace of the error:
Expected result:
abstract though necessary to slow the spread of the novel coronavirus covid-19 action such as social-distancing
shelter in-place restricted travel and closure of key community foundation be likely to dramatically increase
the risk for family violence around the globe in fact many country be already indicate a dramatic increase in
report case of domestic violence while no clear precedent for the current crisis exist in academic literature
explore the impact of natural disaster on family violence report may provide important insight for family violence
victim-serving professional improve collaboration between human welfare and animal welfare agency expand
community partnership and inform the public of the great importance of report any concern of abuse be all
critical at this time
Try to distribute NLP processing...
distributed.worker - WARNING - Compute Failed
Function: extract_tokens
args: (<spacy.lang.en.English object at 0x7f4bf5958d50>)
kwargs: {}
Exception: OSError("[E050] Can't find model 'en_core_sci_lg_vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.")
Traceback (most recent call last):
File "src/spacy_load_vectors_bug.py", line 52, in <module>
tokens = client.gather(lambda_future)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1967, in gather
asynchronous=asynchronous,
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 816, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 347, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 331, in f
result[0] = yield future
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1826, in _gather
raise exception.with_traceback(traceback)
File "src/spacy_load_vectors_bug.py", line 32, in extract_tokens
return [tok.lemma_ for tok in nlp(text) if not tok.is_punct]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/spacy/language.py", line 439, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.__call__
File "pipes.pyx", line 415, in spacy.pipeline.pipes.Tagger.predict
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
return self.predict(x)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
X = layer(X)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
return self.predict(x)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 310, in predict
X = layer(layer.ops.flatten(seqs_in, pad=pad))
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
return self.predict(x)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
X = layer(X)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
return self.predict(x)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 131, in predict
y, _ = self.begin_update(X, drop=None)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
output = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
output = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
output = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
output = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/static_vectors.py", line 60, in begin_update
vector_table = self.get_vectors()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/static_vectors.py", line 55, in get_vectors
return get_vectors(self.ops, self.lang)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/extra/load_nlp.py", line 26, in get_vectors
nlp = get_spacy(lang)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/extra/load_nlp.py", line 14, in get_spacy
SPACY_MODELS[lang] = spacy.load(lang, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
return util.load_model(name, **overrides)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/spacy/util.py", line 169, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_sci_lg_vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Your Environment
## Info about spaCy
* **spaCy version:** 2.2.4
* **Platform:** Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
* **Python version:** 3.7.6
- Environment Information:
Python 3.7
Last dask from conda-forge:
$ conda install -c conda-forge dask
SpaCy installed with following command:
$ pip install -U spacy[cuda100]
SciSpaCy installed with following command:
$ pip install scispacy
NLP model installed with following command:
$ pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
The problem does not seem to be platform specific as I can reproduce the same error on macOS 10.15.
Best Regards.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Dask - How to handle large dataframes in python using ...
This stops you from using numpy, sklearn, pandas, tensorflow, and all the commonly used Python libraries for ML. Is there a solution for...
Read more >Newest 'dask-delayed' Questions - Stack Overflow
I'm trying to use Dask to process a dataset larger than memory, stored in chunks saved as NumPy files. I'm loading the data...
Read more >Speeding up text pre-processing using Dask - Medium
Text preprocessing is one of the most important and time consuming steps of NLP. It is also considered to be a hard part...
Read more >Working notes by Matthew Rocklin - SciPy
Dask dataframes combine Dask and Pandas to deliver a faithful “big data” ... And we load our CSV data using dask.dataframe which looks...
Read more >How to Grid Search Hyperparameters for Deep Learning ...
The GridSearchCV process will then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We could centralise those globals to a singleton object which could be shared among processes.
The workaround is fine enough for now. Thank you for your help.
Let’s close this ticket.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.