question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DependencyMatcher has exponential time complexity

See original GitHub issue

TL;DR

Unfortunately the current implementation of DependencyMatcher can take very very long time to match large documents ^^

How slow ?

This is best illustrated with an example. Consider this dependency tree:

 [[
    {"RIGHT_ID": "is", "RIGHT_ATTRS": {"LEMMA": "be"}},
    {
        "LEFT_ID": "is",
        "REL_OP": ">",
        "RIGHT_ID": "subj",
        "RIGHT_ATTRS": {"DEP": "nsubj"},
    },
    {
        "LEFT_ID": "is",
        "REL_OP": ">",
        "RIGHT_ID": "adj",
        "RIGHT_ATTRS": {"POS": "ADJ",},
    },
]])

Now let’s try to match from 1 to 100 repetitions of the sentence "The dress is beautiful. " and observe how match-time increases as we increase the size of the document. To compare them let’s also benchmark the time taken to parse the document, the time taken by a simple matcher and the time taken by a simple matcher with an on_match callback parsing the dependency tree.

Here is the full benchmark script:

import spacy
from spacy.matcher import DependencyMatcher, Matcher
from time import time


nlp = spacy.load("en_core_web_sm")
text = "The dress is beautiful. "

dependency_matcher = DependencyMatcher(nlp.vocab)
dependency_matcher.add("test", [[
    {"RIGHT_ID": "is", "RIGHT_ATTRS": {"LEMMA": "be"}},
    {
        "LEFT_ID": "is",
        "REL_OP": ">",
        "RIGHT_ID": "subj",
        "RIGHT_ATTRS": {"DEP": "nsubj"},
    },
    {
        "LEFT_ID": "is",
        "REL_OP": ">",
        "RIGHT_ID": "adj",
        "RIGHT_ATTRS": {"POS": "ADJ",},
    },
]])

matcher = Matcher(nlp.vocab)
matcher.add("test", [[
    {"DEP": "nsubj"},
    {"LEMMA": "be"},
    {"POS": "ADJ"},
]])

def callback(matcher, doc, i, matches):
    _, start, _ = matches[i]
    # We are looking for a single token
    match = doc[start]
    subjs = []
    adjs = []
    for child in match.children:
        if child.dep_ == "nsubj":
            subjs.append(child)
        elif child.pos_ == "ADJ":
            adjs.append(child)

matcher_with_callback = Matcher(nlp.vocab)
matcher_with_callback.add("test", [[
    {"LEMMA": "be"},
]], on_match=callback)


def test(n):
    input_text = text*n

    # Benchmark pipeline
    start = time()
    doc = nlp(input_text)
    end = time()

    parse_time = end - start

    # Benchmark a simple matcher
    start = time()
    nb_matches_matcher = len(matcher(doc))
    end = time()

    matcher_time = end - start

    # Benchmark a matcher with on_match callback
    start = time()
    nb_matches_matcher_with_callack  = len(matcher_with_callback(doc))
    end = time()

    matcher_with_callback_time = end - start

    # Benchmark dependency matcher
    start = time()
    nb_matches_dependency_matcher = len(dependency_matcher(doc))
    end = time()

    dependency_matcher_time = end - start

    print(
        f"{n}, {parse_time}, {nb_matches_matcher}, {matcher_time}, "
        f"{nb_matches_matcher_with_callack}, {matcher_with_callback_time}, "
        f"{nb_matches_dependency_matcher}, {dependency_matcher_time}"
    )

for n in range(1,100):
    test(n)

Let’s also enable cProfile to get some additional insights:

$ python -m cProfile -s cumtime bechmark.py

Raw output:

Here is a plot of the result:

image

And here is the same plot in log scale:

image

The dependency matcher (in yellow) takes almost 14 seconds to match 100 repetitions of the string “The dress is beautiful.”, that’s really a lot and makes it unfeasible to use DependencyMatcher to process large amount of data or use it for real-time applications. What’s worse is that processing time grows exponentially, which makes DependencyMatcher only usable for small documents.

Why so slow ?

Here are the top entries of the cProfile report:

   126702727 function calls (101821149 primitive calls) in 331.038 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    907/1    0.004    0.000  331.045  331.045 {built-in method builtins.exec}
        1    0.000    0.000  331.045  331.045 test.py:1(<module>)
       99    0.025    0.000  330.145    3.335 test.py:52(test)
       99    0.021    0.000  327.809    3.311 dependencymatcher.pyx:242(__call__)
24835899/99  106.136    0.000  327.747    3.311 dependencymatcher.pyx:300(recurse)
 49338927    7.338    0.000  221.613    0.000 _asarray.py:14(asarray)
 49343983  214.279    0.000  214.279    0.000 {built-in method numpy.array}
       99    0.388    0.004    2.268    0.023 language.py:952(__call__)
      396    0.003    0.000    1.369    0.003 model.py:308(predict)

The benchmarks script took 331 seconds to run. About 327 seconds (99% of the time) was spent inside DependencyMatcher.recurse, 221 seconds of which were used just to build numpy arrays.

The reason why DependencyMatcher.recurse is so slow lies in this lines:

https://github.com/explosion/spaCy/blob/87562e470d0a38d1919c60941afbda5765f97ef7/spacy/matcher/dependencymatcher.pyx#L262-L264

In practice we are calling recurse with all possible combinations of matched nodes, no matter their location within the document. For example, consider two repetitions of the above string:

“The dress is beautiful. The dress is beautiful.”

Spacy will try correctly match the first “is” with the subject and adjective from the first sentence, however it will also try to match the first “is” with the subject and adjective from the second sentence, same for “is” in the second sentence and the tokens of the first sentence. This also apply to any other combinations. Multiply this for 100 repetitions of the sentence and you got 24835899 calls to recurse (which explains the 14 seconds runtime) 😛

How can we fix this ?

A way to largely improve performance would be to abandon the recurse method and instead use an iterative method that:

  • first, groups matches that belong to the same tree
  • then, matches tokens from each group separately

I believe this should execute in linear time with respect to the document size.

In addition I would remove the conversion to numpy arrays if not necessary, as it turns out to be quite expensive.

Let me know what you think 😃

PS: thank you very much for developing and maintaining this awesome tool ^^

Info about spaCy

  • spaCy version: 3.0.0rc2
  • Platform: Linux-5.10.3-arch1-1-x86_64-with-glibc2.2.5
  • Python version: 3.8.7
  • Pipelines: en_core_web_sm (3.0.0a0), en_core_web_trf (3.0.0a0)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
werewcommented, Jan 23, 2021

The fix has landed in the /develop branch #6744. Closing. To have the fix, use a pre-release version > v3.0.0rc3 or build from source.

1reaction
skrcodecommented, Jan 9, 2021

@werew I had mentioned this problem in the PRs. Maybe these discussions can be of help. https://github.com/explosion/spaCy/pull/2836 https://github.com/explosion/spaCy/pull/3465

Read more comments on GitHub >

github_iconTop Results From Across the Web

DependencyMatcher · spaCy API Documentation
The DependencyMatcher follows the same API as the Matcher and PhraseMatcher and lets you match on dependency trees using Semgrex operators.
Read more >
An Easy-To-Use Guide to Big-O Time Complexity - Medium
This is an example of Quadratic Time Complexity. O(2^N) — Exponential Time Exponential Time complexity denotes an algorithm whose growth doubles ...
Read more >
Limits on the Usefulness of Random Oracles | SpringerLink
The notion of computational differential privacy was considered in ... (Round complexity of the no-oracle protocol) The proof of Theorem 3.7 ...
Read more >
SPARTAN: A Model-Based Semantic Compression System
methods have very high computational complexity, requiring, in the worst case, an exponential number of CI tests. SPARTAN's DEPENDENCYFINDER implements a ...
Read more >
SpaCy's Dependency Matcher - An Introduction - markn/writing
The Matcher is very powerful, and allows you to bootstrap a lot of NLP based tasks, such as entity extraction. However, sometimes you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found