Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Span is not serializable in abbreviations - figure out a better workaround

See original GitHub issue

import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

test = ["Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."]

print("Abbreviation", "\t", "Definition")
for doc in nlp.pipe(test, n_process=4):
    for abrv in doc._.abbreviations:
        print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

Running that code leads to this. The error message doesn’t make a lot of sense, It could be because there are more processes than entries. If you remove n_process the solves the problem.

Abbreviation     Definition
Abbreviation     Definition
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\alexd\Dropbox (UFL)\UFII_COVID19_RESEARCH_TOPICS\cord19\text_parsing_pipeline\test.py", line 13, in <module>
    for doc in nlp.pipe(test, n_process=4):
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1475, in pipe
    for doc in docs:
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1511, in _multiprocessing_pipe
    proc.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

This is the error message from my main piece of code with more data. It sort of makes more sense. I think it has to something to do with how the multiprocess pipe collects the results of the workers. The error pops up after a while so it’s definitely running.

Process Process-1:
Traceback (most recent call last):
  File "C:\Python38\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Python38\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "spacy\tokens\doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
  File "spacy\tokens\doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
  File "C:\Python38\lib\site-packages\spacy\util.py", line 1134, in to_dict
    serialized[key] = getter()
  File "spacy\tokens\doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
  File "C:\Python38\lib\site-packages\srsly\_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "C:\Python38\lib\site-packages\srsly\msgpack\__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly\msgpack\_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly\msgpack\_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.tokens.span.Span' object

Running spacy 3.0, the latest version, and on Windows 10.

Issue Analytics

State:
Created 3 years ago
Comments:10

Top GitHub Comments

1reaction

dakingggcommented, Mar 17, 2021

The other github issue i linked to shows how you can convert the Span objects to serializable json (https://github.com/allenai/scispacy/issues/205#issuecomment-597273144). You would simply add this function as a final pipe in your scispacy pipeline. This would mean that your pipeline produces serializable documents, which should work fine with multiprocessing.

0reactions

dakingggcommented, Jul 15, 2021

See #368