question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Span is not serializable in abbreviations - figure out a better workaround

See original GitHub issue
import spacy

from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_sci_sm")

# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")

test = ["Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."]

print("Abbreviation", "\t", "Definition")
for doc in nlp.pipe(test, n_process=4):
    for abrv in doc._.abbreviations:
        print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

Running that code leads to this. The error message doesn’t make a lot of sense, It could be because there are more processes than entries. If you remove n_process the solves the problem.

Abbreviation     Definition
Abbreviation     Definition
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\alexd\Dropbox (UFL)\UFII_COVID19_RESEARCH_TOPICS\cord19\text_parsing_pipeline\test.py", line 13, in <module>
    for doc in nlp.pipe(test, n_process=4):
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1475, in pipe
    for doc in docs:
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1511, in _multiprocessing_pipe
    proc.start()
  File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "C:\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

This is the error message from my main piece of code with more data. It sort of makes more sense. I think it has to something to do with how the multiprocess pipe collects the results of the workers. The error pops up after a while so it’s definitely running.

Process Process-1:
Traceback (most recent call last):
  File "C:\Python38\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\Python38\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in _apply_pipes
    sender.send([doc.to_bytes() for doc in docs])
  File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in <listcomp>
    sender.send([doc.to_bytes() for doc in docs])
  File "spacy\tokens\doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
  File "spacy\tokens\doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
  File "C:\Python38\lib\site-packages\spacy\util.py", line 1134, in to_dict
    serialized[key] = getter()
  File "spacy\tokens\doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
  File "C:\Python38\lib\site-packages\srsly\_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "C:\Python38\lib\site-packages\srsly\msgpack\__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly\msgpack\_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly\msgpack\_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly\msgpack\_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.tokens.span.Span' object

Running spacy 3.0, the latest version, and on Windows 10.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10

github_iconTop GitHub Comments

1reaction
dakingggcommented, Mar 17, 2021

The other github issue i linked to shows how you can convert the Span objects to serializable json (https://github.com/allenai/scispacy/issues/205#issuecomment-597273144). You would simply add this function as a final pipe in your scispacy pipeline. This would mean that your pipeline produces serializable documents, which should work fine with multiprocessing.

0reactions
dakingggcommented, Jul 15, 2021

See #368

Read more comments on GitHub >

github_iconTop Results From Across the Web

CA2235: Mark all non-serializable fields (code analysis) - .NET
To fix a violation of this rule, apply the System.NonSerializedAttribute attribute to the field that is not serializable.
Read more >
Language Processing Pipelines · spaCy Usage Documentation
This example shows a stateful pipeline component for handling acronyms: based on a dictionary, it will detect acronyms and their expanded forms in...
Read more >
The Accessibility Hat Trick: Getting Abbreviations Right
One possible solution (although not the best way to go) is with span tags: .abbr { border-bottom-width: 1px; border-bottom-style: dotted; }
Read more >
YAML Ain't Markup Language (YAML™) revision 1.2.2
YAML (a recursive acronym for “YAML Ain't Markup Language”) is a data serialization language designed to be human-friendly and work well ...
Read more >
Object of type Period is not JSON serializable in plotly
AFAIK this is still an issue and plotly will fail in such situation. There is still an open issue at github: Support for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found