Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use in Apache Spark / English() object cannot be pickled

See original GitHub issue

For spaCy to work out of the box with Apache Spark the language modles need to be pickled so that they can be initialised on the master node and then sent to the workers.

This currently doesn’t work with plain pickle, failing as follows:

>>> from __future__ import unicode_literals, print_function
>>> from spacy.en import English
>>> import pickle
>>> nlp = English()
>>> nlpp = pickle.dumps(nlp)
Traceback (most recent call last):
[...]
TypeError: can't pickle Vocab objects

Apache Spark ships with a package called cloudpickle which is meant to support a wider set of Python constructs, but serialisation with cloudpickle also fails resulting in a segmentation fault:

>>> from pyspark import cloudpickle
>>> pickled_nlp = cloudpickle.dumps(nlp)
>>> nlpp = pickle.dumps(nlp)
>>> nlpp('test text')
Segmentation fault

By default Apache Spark uses pickle, but can be told to use cloudpickle instead.

Currently a feasable workaround is lazy loading of the language models on the worker nodes:

global nlp
def lazyloaded_nlp(s):
    global nlp
    try:
        return nlp(s)
    except:
        nlp = English()
        return nlp(s)

The above works. Nevertheless, I wonder if it would be possible to make the English() object pickleable? If not too difficult from your end, having the language models pickleable would provide a better out of box experience for Apache Spark users.

Issue Analytics

State:
Created 8 years ago
Reactions:2
Comments:53 (38 by maintainers)

Top GitHub Comments

1reaction

mikepbcommented, Apr 7, 2016

I’ve had success using the Pickeless workaround in https://github.com/spacy-io/spaCy/issues/125#issuecomment-185881231

The wrapper class essentially tells Python not to pickle the English. Instead, the Pickleless English are reloaded and cached for each process, when defined as a global variable.

Remember that custom vocabulary will not transfer between Python processes using this method.

1reaction

honnibalcommented, Oct 6, 2015

I’ve spent a little time looking into this now.

The workflow that’s a little bit tricky to support is something like this:

Create an English() instance
Change the state of some binary data, e.g. modify the lexicon
Send it to workers, with new state preserved

Now, when I say “a little bit tricky”…If this is a requirement, we can do it. It’ll mean writing out all the state to binary data strings, shipping ~1gb to each worker, and then loading from the strings. The patch will touch every class, and it might be fiddly, especially to keep efficiency nice. But there’s no real problem.

The question is whether this work-flow is really important. I would’ve thought that the better way to do things was to divide the documents in the master node, and then send a reference to a function like this:


def do_work(batch_of_texts):
    nlp = English()
    for text in texts:
        doc = nlp(text)
        # Stuff

distribute(texts, do_work, n_workers=10)

Does PySpark not work this way?

Top Results From Across the Web

python - Spark can't pickle method_descriptor - Stack Overflow

Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db...

Source code for pyspark.serializers - Apache Spark

Source code for pyspark.serializers ... [docs]class PickleSerializer(FramedSerializer): """ Serializes objects using Python's pickle serializer: http://docs.

Making PySpark Work with spaCy: Overcoming Serialization ...

In this guest post, Holden Karau, Apache Spark Committer, provides insights on how to use spaCy to process text data.

PicklingError in pyspark (PicklingError: Can't pickle <class ...

Possible answer from here. The problem is that you're trying to pickle an object from the module where it's defined. If you move...

How (Not) to Tune Your Model With Hyperopt - Databricks

Utilize parallelism on an Apache Spark cluster optimally; Optimize execution of Hyperopt trials; Use MLflow to track models. What is Hyperopt?