question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use in Apache Spark / English() object cannot be pickled

See original GitHub issue

For spaCy to work out of the box with Apache Spark the language modles need to be pickled so that they can be initialised on the master node and then sent to the workers.

This currently doesn’t work with plain pickle, failing as follows:

>>> from __future__ import unicode_literals, print_function
>>> from spacy.en import English
>>> import pickle
>>> nlp = English()
>>> nlpp = pickle.dumps(nlp)
Traceback (most recent call last):
[...]
TypeError: can't pickle Vocab objects

Apache Spark ships with a package called cloudpickle which is meant to support a wider set of Python constructs, but serialisation with cloudpickle also fails resulting in a segmentation fault:

>>> from pyspark import cloudpickle
>>> pickled_nlp = cloudpickle.dumps(nlp)
>>> nlpp = pickle.dumps(nlp)
>>> nlpp('test text')
Segmentation fault

By default Apache Spark uses pickle, but can be told to use cloudpickle instead.

Currently a feasable workaround is lazy loading of the language models on the worker nodes:

global nlp
def lazyloaded_nlp(s):
    global nlp
    try:
        return nlp(s)
    except:
        nlp = English()
        return nlp(s)

The above works. Nevertheless, I wonder if it would be possible to make the English() object pickleable? If not too difficult from your end, having the language models pickleable would provide a better out of box experience for Apache Spark users.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:2
  • Comments:53 (38 by maintainers)

github_iconTop GitHub Comments

1reaction
mikepbcommented, Apr 7, 2016

I’ve had success using the Pickeless workaround in https://github.com/spacy-io/spaCy/issues/125#issuecomment-185881231

The wrapper class essentially tells Python not to pickle the English. Instead, the Pickleless English are reloaded and cached for each process, when defined as a global variable.

Remember that custom vocabulary will not transfer between Python processes using this method.

1reaction
honnibalcommented, Oct 6, 2015

I’ve spent a little time looking into this now.

The workflow that’s a little bit tricky to support is something like this:

  • Create an English() instance
  • Change the state of some binary data, e.g. modify the lexicon
  • Send it to workers, with new state preserved

Now, when I say “a little bit tricky”…If this is a requirement, we can do it. It’ll mean writing out all the state to binary data strings, shipping ~1gb to each worker, and then loading from the strings. The patch will touch every class, and it might be fiddly, especially to keep efficiency nice. But there’s no real problem.

The question is whether this work-flow is really important. I would’ve thought that the better way to do things was to divide the documents in the master node, and then send a reference to a function like this:


def do_work(batch_of_texts):
    nlp = English()
    for text in texts:
        doc = nlp(text)
        # Stuff

distribute(texts, do_work, n_workers=10)

Does PySpark not work this way?

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Spark can't pickle method_descriptor - Stack Overflow
Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db...
Read more >
Source code for pyspark.serializers - Apache Spark
Source code for pyspark.serializers ... [docs]class PickleSerializer(FramedSerializer): """ Serializes objects using Python's pickle serializer: http://docs.
Read more >
Making PySpark Work with spaCy: Overcoming Serialization ...
In this guest post, Holden Karau, Apache Spark Committer, provides insights on how to use spaCy to process text data.
Read more >
PicklingError in pyspark (PicklingError: Can't pickle <class ...
Possible answer from here. The problem is that you're trying to pickle an object from the module where it's defined. If you move...
Read more >
How (Not) to Tune Your Model With Hyperopt - Databricks
Utilize parallelism on an Apache Spark cluster optimally; Optimize execution of Hyperopt trials; Use MLflow to track models. What is Hyperopt?
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found