DependencyMatcher has exponential time complexity
See original GitHub issueTL;DR
Unfortunately the current implementation of DependencyMatcher can take very very long time to match large documents ^^
How slow ?
This is best illustrated with an example. Consider this dependency tree:
[[
{"RIGHT_ID": "is", "RIGHT_ATTRS": {"LEMMA": "be"}},
{
"LEFT_ID": "is",
"REL_OP": ">",
"RIGHT_ID": "subj",
"RIGHT_ATTRS": {"DEP": "nsubj"},
},
{
"LEFT_ID": "is",
"REL_OP": ">",
"RIGHT_ID": "adj",
"RIGHT_ATTRS": {"POS": "ADJ",},
},
]])
Now let’s try to match from 1 to 100 repetitions of the sentence "The dress is beautiful. " and observe how match-time increases as we increase the size of the document. To compare them let’s also benchmark the time taken to parse the document, the time taken by a simple matcher and the time taken by a simple matcher with an on_match callback parsing the dependency tree.
Here is the full benchmark script:
import spacy
from spacy.matcher import DependencyMatcher, Matcher
from time import time
nlp = spacy.load("en_core_web_sm")
text = "The dress is beautiful. "
dependency_matcher = DependencyMatcher(nlp.vocab)
dependency_matcher.add("test", [[
{"RIGHT_ID": "is", "RIGHT_ATTRS": {"LEMMA": "be"}},
{
"LEFT_ID": "is",
"REL_OP": ">",
"RIGHT_ID": "subj",
"RIGHT_ATTRS": {"DEP": "nsubj"},
},
{
"LEFT_ID": "is",
"REL_OP": ">",
"RIGHT_ID": "adj",
"RIGHT_ATTRS": {"POS": "ADJ",},
},
]])
matcher = Matcher(nlp.vocab)
matcher.add("test", [[
{"DEP": "nsubj"},
{"LEMMA": "be"},
{"POS": "ADJ"},
]])
def callback(matcher, doc, i, matches):
_, start, _ = matches[i]
# We are looking for a single token
match = doc[start]
subjs = []
adjs = []
for child in match.children:
if child.dep_ == "nsubj":
subjs.append(child)
elif child.pos_ == "ADJ":
adjs.append(child)
matcher_with_callback = Matcher(nlp.vocab)
matcher_with_callback.add("test", [[
{"LEMMA": "be"},
]], on_match=callback)
def test(n):
input_text = text*n
# Benchmark pipeline
start = time()
doc = nlp(input_text)
end = time()
parse_time = end - start
# Benchmark a simple matcher
start = time()
nb_matches_matcher = len(matcher(doc))
end = time()
matcher_time = end - start
# Benchmark a matcher with on_match callback
start = time()
nb_matches_matcher_with_callack = len(matcher_with_callback(doc))
end = time()
matcher_with_callback_time = end - start
# Benchmark dependency matcher
start = time()
nb_matches_dependency_matcher = len(dependency_matcher(doc))
end = time()
dependency_matcher_time = end - start
print(
f"{n}, {parse_time}, {nb_matches_matcher}, {matcher_time}, "
f"{nb_matches_matcher_with_callack}, {matcher_with_callback_time}, "
f"{nb_matches_dependency_matcher}, {dependency_matcher_time}"
)
for n in range(1,100):
test(n)
Let’s also enable cProfile to get some additional insights:
$ python -m cProfile -s cumtime bechmark.py
Raw output:
- timing: http://ix.io/2L0K
- cProfile: http://ix.io/2L0O
Here is a plot of the result:
And here is the same plot in log scale:
The dependency matcher (in yellow) takes almost 14 seconds to match 100 repetitions of the string “The dress is beautiful.”, that’s really a lot and makes it unfeasible to use DependencyMatcher to process large amount of data or use it for real-time applications. What’s worse is that processing time grows exponentially, which makes DependencyMatcher only usable for small documents.
Why so slow ?
Here are the top entries of the cProfile report:
126702727 function calls (101821149 primitive calls) in 331.038 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
907/1 0.004 0.000 331.045 331.045 {built-in method builtins.exec}
1 0.000 0.000 331.045 331.045 test.py:1(<module>)
99 0.025 0.000 330.145 3.335 test.py:52(test)
99 0.021 0.000 327.809 3.311 dependencymatcher.pyx:242(__call__)
24835899/99 106.136 0.000 327.747 3.311 dependencymatcher.pyx:300(recurse)
49338927 7.338 0.000 221.613 0.000 _asarray.py:14(asarray)
49343983 214.279 0.000 214.279 0.000 {built-in method numpy.array}
99 0.388 0.004 2.268 0.023 language.py:952(__call__)
396 0.003 0.000 1.369 0.003 model.py:308(predict)
The benchmarks script took 331 seconds to run. About 327 seconds (99% of the time) was spent inside DependencyMatcher.recurse, 221 seconds of which were used just to build numpy arrays.
The reason why DependencyMatcher.recurse is so slow lies in this lines:
In practice we are calling recurse with all possible combinations of matched nodes, no matter their location within the document. For example, consider two repetitions of the above string:
“The dress is beautiful. The dress is beautiful.”
Spacy will try correctly match the first “is” with the subject and adjective from the first sentence, however it will also try to match the first “is” with the subject and adjective from the second sentence, same for “is” in the second sentence and the tokens of the first sentence. This also apply to any other combinations. Multiply this for 100 repetitions of the sentence and you got 24835899 calls to recurse (which explains the 14 seconds runtime) 😛
How can we fix this ?
A way to largely improve performance would be to abandon the recurse method and instead use an iterative method that:
- first, groups matches that belong to the same tree
- then, matches tokens from each group separately
I believe this should execute in linear time with respect to the document size.
In addition I would remove the conversion to numpy arrays if not necessary, as it turns out to be quite expensive.
Let me know what you think 😃
PS: thank you very much for developing and maintaining this awesome tool ^^
Info about spaCy
- spaCy version: 3.0.0rc2
- Platform: Linux-5.10.3-arch1-1-x86_64-with-glibc2.2.5
- Python version: 3.8.7
- Pipelines: en_core_web_sm (3.0.0a0), en_core_web_trf (3.0.0a0)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (9 by maintainers)
Top GitHub Comments
The fix has landed in the /develop branch #6744. Closing. To have the fix, use a pre-release version > v3.0.0rc3 or build from source.
@werew I had mentioned this problem in the PRs. Maybe these discussions can be of help. https://github.com/explosion/spaCy/pull/2836 https://github.com/explosion/spaCy/pull/3465