Slow loading of the pipe `scispacy_linker`
See original GitHub issueHi, loading an UMLS linker is particularly slow (~20-30s). It is a real issue when testing the code. I reported the profiler output bellow. Is there anything we can do to speed-up the loading of the linker?
Profiler output
Ordered by: internal time
List reduced from 951 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
1 19.741 19.741 53.338 53.338 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/linking_utils.py:55(__init__)
1 18.422 18.422 25.783 25.783 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/candidate_generation.py:116(load_approximate_nearest_neighbours_index)
3359672 16.912 0.000 16.912 0.000 /Users/-/anaconda3/lib/python3.8/json/decoder.py:343(raw_decode)
3359672 3.847 0.000 24.272 0.000 /Users/-/anaconda3/lib/python3.8/json/decoder.py:332(decode)
4023 3.202 0.001 3.202 0.001 {method 'decompress' of 'zlib.Decompress' objects}
3359672 2.840 0.000 30.230 0.000 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/linking_utils.py:65(<genexpr>)
3359672 2.818 0.000 28.086 0.000 /Users/-/anaconda3/lib/python3.8/json/__init__.py:299(loads)
6719602 2.603 0.000 2.603 0.000 {method 'match' of 're.Pattern' objects}
6 2.251 0.375 2.251 0.375 {method 'do_handshake' of '_ssl._SSLSocket' objects}
6 1.206 0.201 1.206 0.201 {method 'read' of '_ssl._SSLSocket' objects}
6 1.122 0.187 1.122 0.187 {method 'connect' of '_socket.socket' objects}
9300568 1.002 0.000 1.002 0.000 {method 'add' of 'set' objects}
3359671 0.867 0.000 1.565 0.000 <string>:1(__new__)
4033 0.763 0.000 0.763 0.000 {built-in method zlib.crc32}
3360030 0.704 0.000 0.704 0.000 {built-in method __new__ of type object at 0x10c379808}
3359928 0.679 0.000 0.679 0.000 {method 'startswith' of 'str' objects}
6719344 0.581 0.000 0.581 0.000 {method 'end' of 're.Match' objects}
2 0.525 0.262 0.525 0.262 {method 'astype' of 'numpy.ndarray' objects}
5 0.474 0.095 4.703 0.941 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/numpy/lib/format.py:699(read_array)
2 0.369 0.184 0.369 0.184 {method 'copy' of 'numpy.ndarray' objects}
Code to reproduce the above results
import cProfile
import pstats
from time import time
import spacy
from scispacy.abbreviation import AbbreviationDetector # type: ignore
from scispacy.linking import EntityLinker # type: ignore
def load_spacy_model(model_name: str):
"""Load a ScispaCy model"""
model = spacy.load(
model_name,
disable=[
"tok2vec",
"tagger",
"parser",
"attribute_ruler",
"lemmatizer",
],
)
return model
def add_scispacy_linker(model):
"""add the entity linker (slow loading)"""
model.add_pipe(
"scispacy_linker",
config={"linker_name": "umls"},
)
return model
# load the model (ok loading time)
model = load_spacy_model(model_name="en_core_sci_sm")
# load the linker (slow loading time) + profiling
profiler = cProfile.Profile()
profiler.enable()
t0 = time()
model = add_scispacy_linker(model)
duration = time() - t0
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("time")
stats.print_stats(20)
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Day 201: Abbreviation Resolution and UMLS Entity Linking ...
Import dependencies and load our SciSpacy Model and Pipeline ... Detects abbreviation within text but only if the long-form text are within ...
Read more >Speed up Spacy Named Entity Recognition - Stack Overflow
I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very...
Read more >Using scispaCy for Named-Entity Recognition (Part 1)
scispaCy is a powerful tool, especially for named entity recognition ... Just follow the instructions as described earlier, then load them.
Read more >Extracting entities linked to UMLS with scispaCy - Kaggle
Collecting scispacy Downloading scispacy-0.2.4.tar.gz (38 kB) Requirement ... to the spacy pipeline linker = UmlsEntityLinker(resolve_abbreviations=True, ...
Read more >Clinical Natural Language Processing in Python
import spacy import scispacy from scispacy.linking import EntityLinker nlp = spacy.load("en_core_sci_sm") nlp.add_pipe("scispacy_linker", ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @dakinggg, files are effectively cached, so it is simply about loading the UMLS index. @MichalMalyska, yes, this is approximately what I get (profiling output in the opening post).
The profiler shows that most of the time is spent decoding
json
objects:I am wondering if there is a more efficient way to store, load and query the data. Furthermore, the current solution is very memory intensive (RAM usage spikes at 8GB RAM when running the above example).
Two ideas for improvement are:
pyarrow
to store the alias listfaiss
to improve upon the current nearest neighbour search (at least in terms of speed)?Those are only suggestion as I don’t know enough about the inner working of
scipacy
. Regarding my project, this issue is not critical, but that might be a nice improvement for the library.Hi @MichalMalyska, I am only getting started with faiss, so unfortunately I don’t know about aliasing in faiss. But if I get a definite answer in the near future, I’ll let you know here.
I am not an expert with nmsn either. So please take these suggestions for what they are: ideas and not recommendations.