Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow loading of the pipe `scispacy_linker`

See original GitHub issue

Hi, loading an UMLS linker is particularly slow (~20-30s). It is a real issue when testing the code. I reported the profiler output bellow. Is there anything we can do to speed-up the loading of the linker?

Profiler output

   Ordered by: internal time
   List reduced from 951 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   19.741   19.741   53.338   53.338 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/linking_utils.py:55(__init__)
        1   18.422   18.422   25.783   25.783 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/candidate_generation.py:116(load_approximate_nearest_neighbours_index)
  3359672   16.912    0.000   16.912    0.000 /Users/-/anaconda3/lib/python3.8/json/decoder.py:343(raw_decode)
  3359672    3.847    0.000   24.272    0.000 /Users/-/anaconda3/lib/python3.8/json/decoder.py:332(decode)
     4023    3.202    0.001    3.202    0.001 {method 'decompress' of 'zlib.Decompress' objects}
  3359672    2.840    0.000   30.230    0.000 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/scispacy/linking_utils.py:65(<genexpr>)
  3359672    2.818    0.000   28.086    0.000 /Users/-/anaconda3/lib/python3.8/json/__init__.py:299(loads)
  6719602    2.603    0.000    2.603    0.000 {method 'match' of 're.Pattern' objects}
        6    2.251    0.375    2.251    0.375 {method 'do_handshake' of '_ssl._SSLSocket' objects}
        6    1.206    0.201    1.206    0.201 {method 'read' of '_ssl._SSLSocket' objects}
        6    1.122    0.187    1.122    0.187 {method 'connect' of '_socket.socket' objects}
  9300568    1.002    0.000    1.002    0.000 {method 'add' of 'set' objects}
  3359671    0.867    0.000    1.565    0.000 <string>:1(__new__)
     4033    0.763    0.000    0.763    0.000 {built-in method zlib.crc32}
  3360030    0.704    0.000    0.704    0.000 {built-in method __new__ of type object at 0x10c379808}
  3359928    0.679    0.000    0.679    0.000 {method 'startswith' of 'str' objects}
  6719344    0.581    0.000    0.581    0.000 {method 'end' of 're.Match' objects}
        2    0.525    0.262    0.525    0.262 {method 'astype' of 'numpy.ndarray' objects}
        5    0.474    0.095    4.703    0.941 /Users/-/Library/Caches/pypoetry/virtualenvs/fz-openqa-rEqQaPFC-py3.8/lib/python3.8/site-packages/numpy/lib/format.py:699(read_array)
        2    0.369    0.184    0.369    0.184 {method 'copy' of 'numpy.ndarray' objects}

Code to reproduce the above results

import cProfile
import pstats
from time import time

import spacy
from scispacy.abbreviation import AbbreviationDetector  # type: ignore
from scispacy.linking import EntityLinker  # type: ignore


def load_spacy_model(model_name: str):
   """Load a ScispaCy model"""
    model = spacy.load(
        model_name,
        disable=[
            "tok2vec",
            "tagger",
            "parser",
            "attribute_ruler",
            "lemmatizer",
        ],
    )

    return model


def add_scispacy_linker(model):
    """add the entity linker (slow loading)"""
    model.add_pipe(
        "scispacy_linker",
        config={"linker_name": "umls"},
    )
    return model

# load the model (ok loading time)
model = load_spacy_model(model_name="en_core_sci_sm")

# load the linker (slow loading time) + profiling
profiler = cProfile.Profile()
profiler.enable()
t0 = time()
model = add_scispacy_linker(model)
duration = time() - t0
profiler.disable()
stats = pstats.Stats(profiler).sort_stats("time")
stats.print_stats(20)

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

vlievincommented, Oct 27, 2021

Hi @dakinggg, files are effectively cached, so it is simply about loading the UMLS index. @MichalMalyska, yes, this is approximately what I get (profiling output in the opening post).

The profiler shows that most of the time is spent decoding json objects:

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3359672   16.912    0.000   16.912    0.000 .../python3.8/json/decoder.py:343(raw_decode)

I am wondering if there is a more efficient way to store, load and query the data. Furthermore, the current solution is very memory intensive (RAM usage spikes at 8GB RAM when running the above example).

Two ideas for improvement are:

pyarrow to store the alias list
faiss to improve upon the current nearest neighbour search (at least in terms of speed)?

Those are only suggestion as I don’t know enough about the inner working of scipacy. Regarding my project, this issue is not critical, but that might be a nice improvement for the library.

1reaction

vlievincommented, Oct 29, 2021

Hi @MichalMalyska, I am only getting started with faiss, so unfortunately I don’t know about aliasing in faiss. But if I get a definite answer in the near future, I’ll let you know here.

I am not an expert with nmsn either. So please take these suggestions for what they are: ideas and not recommendations.

Top Results From Across the Web

Day 201: Abbreviation Resolution and UMLS Entity Linking ...

Import dependencies and load our SciSpacy Model and Pipeline ... Detects abbreviation within text but only if the long-form text are within ...

Speed up Spacy Named Entity Recognition - Stack Overflow

I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very...

Using scispaCy for Named-Entity Recognition (Part 1)

scispaCy is a powerful tool, especially for named entity recognition ... Just follow the instructions as described earlier, then load them.

Extracting entities linked to UMLS with scispaCy - Kaggle

Collecting scispacy Downloading scispacy-0.2.4.tar.gz (38 kB) Requirement ... to the spacy pipeline linker = UmlsEntityLinker(resolve_abbreviations=True, ...

Clinical Natural Language Processing in Python

import spacy import scispacy from scispacy.linking import EntityLinker nlp = spacy.load("en_core_sci_sm") nlp.add_pipe("scispacy_linker", ...