Span is not serializable in abbreviations - figure out a better workaround
See original GitHub issueimport spacy
from scispacy.abbreviation import AbbreviationDetector
nlp = spacy.load("en_core_sci_sm")
# Add the abbreviation pipe to the spacy pipeline.
nlp.add_pipe("abbreviation_detector")
test = ["Spinal and bulbar muscular atrophy (SBMA) is an inherited motor neuron disease caused by the expansion of a polyglutamine tract within the androgen receptor (AR). SBMA can be caused by this easily."]
print("Abbreviation", "\t", "Definition")
for doc in nlp.pipe(test, n_process=4):
for abrv in doc._.abbreviations:
print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
Running that code leads to this. The error message doesn’t make a lot of sense, It could be because there are more processes than entries. If you remove n_process
the solves the problem.
Abbreviation Definition
Abbreviation Definition
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Python38\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Python38\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Python38\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Python38\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\alexd\Dropbox (UFL)\UFII_COVID19_RESEARCH_TOPICS\cord19\text_parsing_pipeline\test.py", line 13, in <module>
for doc in nlp.pipe(test, n_process=4):
File "C:\Python38\lib\site-packages\spacy\language.py", line 1475, in pipe
for doc in docs:
File "C:\Python38\lib\site-packages\spacy\language.py", line 1511, in _multiprocessing_pipe
proc.start()
File "C:\Python38\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Python38\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Python38\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Python38\lib\multiprocessing\popen_spawn_win32.py", line 45, in __init__
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Python38\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Python38\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
This is the error message from my main piece of code with more data. It sort of makes more sense. I think it has to something to do with how the multiprocess pipe collects the results of the workers. The error pops up after a while so it’s definitely running.
Process Process-1:
Traceback (most recent call last):
File "C:\Python38\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\Python38\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in _apply_pipes
sender.send([doc.to_bytes() for doc in docs])
File "C:\Python38\lib\site-packages\spacy\language.py", line 1995, in <listcomp>
sender.send([doc.to_bytes() for doc in docs])
File "spacy\tokens\doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
File "spacy\tokens\doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
File "C:\Python38\lib\site-packages\spacy\util.py", line 1134, in to_dict
serialized[key] = getter()
File "spacy\tokens\doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
File "C:\Python38\lib\site-packages\srsly\_msgpack_api.py", line 14, in msgpack_dumps
return msgpack.dumps(data, use_bin_type=True)
File "C:\Python38\lib\site-packages\srsly\msgpack\__init__.py", line 55, in packb
return Packer(**kwargs).pack(o)
File "srsly\msgpack\_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
File "srsly\msgpack\_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
File "srsly\msgpack\_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
File "srsly\msgpack\_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
File "srsly\msgpack\_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.tokens.span.Span' object
Running spacy 3.0, the latest version, and on Windows 10.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10
Top Results From Across the Web
CA2235: Mark all non-serializable fields (code analysis) - .NET
To fix a violation of this rule, apply the System.NonSerializedAttribute attribute to the field that is not serializable.
Read more >Language Processing Pipelines · spaCy Usage Documentation
This example shows a stateful pipeline component for handling acronyms: based on a dictionary, it will detect acronyms and their expanded forms in...
Read more >The Accessibility Hat Trick: Getting Abbreviations Right
One possible solution (although not the best way to go) is with span tags: .abbr { border-bottom-width: 1px; border-bottom-style: dotted; }
Read more >YAML Ain't Markup Language (YAML™) revision 1.2.2
YAML (a recursive acronym for “YAML Ain't Markup Language”) is a data serialization language designed to be human-friendly and work well ...
Read more >Object of type Period is not JSON serializable in plotly
AFAIK this is still an issue and plotly will fail in such situation. There is still an open issue at github: Support for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The other github issue i linked to shows how you can convert the
Span
objects to serializable json (https://github.com/allenai/scispacy/issues/205#issuecomment-597273144). You would simply add this function as a final pipe in your scispacy pipeline. This would mean that your pipeline produces serializable documents, which should work fine with multiprocessing.See #368