question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird vectors loading bug when trying to distribute NLP processing with Dask

See original GitHub issue

Dear spaCy team,

I am struggling with a bug when trying to distribute NLP processing with Dask.

Here is a small exemple of code which trigger the bug:

How to reproduce the behaviour

import dask.distributed
import dask.delayed
import spacy


text = '''
Abstract Though necessary to slow the spread of the novel Coronavirus (Covid-19), actions such as social-distancing,
sheltering in-place, restricted travel, and closures of key community foundations are likely to dramatically increase
the risk for family violence around the globe. In fact many countries are already indicating a dramatic increase in
reported cases of domestic violence. While no clear precedent for the current crisis exists in academic literature,
exploring the impact of natural disasters on family violence reports may provide important insight for family violence
victim-serving professionals. Improving collaborations between human welfare and animal welfare agencies, expanding
community partnerships, and informing the public of the great importance of reporting any concerns of abuse are all
critical at this time.'''


def extract_tokens(nlp):
    return [tok.lemma_ for tok in nlp(text) if not tok.is_punct]


# Original code which works
if __name__ == '__main__':
    nlp = spacy.load('en_core_sci_lg')
    tokens = extract_tokens(nlp)
    print('Expected result:')
    print(' '.join(tokens))
    print()


# Distributed code which trigger spaCy vector loading bug
if __name__ == '__main__':
    client = dask.distributed.Client(n_workers=1, processes=True)

    nlp = spacy.load('en_core_sci_lg')
    nlp_future = client.scatter(nlp)
    lambda_future = client.submit(extract_tokens, nlp_future)
    print('Try to distribute NLP processing...')
    tokens = client.gather(lambda_future)
    print(' '.join(tokens))

Here is the output and the stack trace of the error:

Expected result:

 abstract though necessary to slow the spread of the novel coronavirus covid-19 action such as social-distancing 
 shelter in-place restricted travel and closure of key community foundation be likely to dramatically increase 
 the risk for family violence around the globe in fact many country be already indicate a dramatic increase in 
 report case of domestic violence while no clear precedent for the current crisis exist in academic literature 
 explore the impact of natural disaster on family violence report may provide important insight for family violence 
 victim-serving professional improve collaboration between human welfare and animal welfare agency expand 
 community partnership and inform the public of the great importance of report any concern of abuse be all 
 critical at this time

Try to distribute NLP processing...
distributed.worker - WARNING -  Compute Failed
Function:  extract_tokens
args:      (<spacy.lang.en.English object at 0x7f4bf5958d50>)
kwargs:    {}
Exception: OSError("[E050] Can't find model 'en_core_sci_lg_vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.")

Traceback (most recent call last):
  File "src/spacy_load_vectors_bug.py", line 52, in <module>
    tokens = client.gather(lambda_future)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1967, in gather
    asynchronous=asynchronous,
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 816, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 347, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 331, in f
    result[0] = yield future
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 1826, in _gather
    raise exception.with_traceback(traceback)
  File "src/spacy_load_vectors_bug.py", line 32, in extract_tokens
    return [tok.lemma_ for tok in nlp(text) if not tok.is_punct]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/spacy/language.py", line 439, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.__call__
  File "pipes.pyx", line 415, in spacy.pipeline.pipes.Tagger.predict
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
    X = layer(X)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 310, in predict
    X = layer(layer.ops.flatten(seqs_in, pad=pad))
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 40, in predict
    X = layer(X)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 131, in predict
    y, _ = self.begin_update(X, drop=None)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
    Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/static_vectors.py", line 60, in begin_update
    vector_table = self.get_vectors()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/neural/_classes/static_vectors.py", line 55, in get_vectors
    return get_vectors(self.ops, self.lang)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/extra/load_nlp.py", line 26, in get_vectors
    nlp = get_spacy(lang)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/thinc/extra/load_nlp.py", line 14, in get_spacy
    SPACY_MODELS[lang] = spacy.load(lang, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/spacy/__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/spacy/util.py", line 169, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_sci_lg_vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Your Environment

## Info about spaCy

* **spaCy version:** 2.2.4
* **Platform:** Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
* **Python version:** 3.7.6
  • Environment Information:
Python 3.7

Last dask from conda-forge:
$ conda install -c conda-forge dask

SpaCy installed with following command:
$ pip install -U spacy[cuda100]

SciSpaCy installed with following command:
$ pip install scispacy

NLP model installed with following command:
$ pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

The problem does not seem to be platform specific as I can reproduce the same error on macOS 10.15.

Best Regards.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
bu2commented, Jun 2, 2020

We could centralise those globals to a singleton object which could be shared among processes.

The workaround is fine enough for now. Thank you for your help.

Let’s close this ticket.

0reactions
github-actions[bot]commented, Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask - How to handle large dataframes in python using ...
This stops you from using numpy, sklearn, pandas, tensorflow, and all the commonly used Python libraries for ML. Is there a solution for...
Read more >
Newest 'dask-delayed' Questions - Stack Overflow
I'm trying to use Dask to process a dataset larger than memory, stored in chunks saved as NumPy files. I'm loading the data...
Read more >
Speeding up text pre-processing using Dask - Medium
Text preprocessing is one of the most important and time consuming steps of NLP. It is also considered to be a hard part...
Read more >
Working notes by Matthew Rocklin - SciPy
Dask dataframes combine Dask and Pandas to deliver a faithful “big data” ... And we load our CSV data using dask.dataframe which looks...
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
The GridSearchCV process will then construct and evaluate one model for each combination of parameters. Cross validation is used to evaluate ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found