General match_intervals improvements
See original GitHub issueDescription
In auditing match_intervals
in preparation for numbafication #157, I noticed a couple of undocumented / undefined behaviors that should be made explicit. I’m tagging these as “bug” because they are definitely not intended behavior, and could silently cause some strange behaviors down the road.
- What should happen when a query interval has no overlap with any of the targets? The current logic for this is here, which would amount to argmax over an all zeros vector.
np.argmax
breaks ties in favor of the first maximal element, which in this case, would be 0 (the first target interval). This seems bad and wrong here.
Some suggested alternative behaviors:
- Use a null value.
None
seems reasonable, but it would require returning a list of ints instead of andarray(dtype=int)
.NaN
also seems reasonable, but would requiredtype=float
, which would not work with fancy indexing. - Throw an exception. At the very least, we should throw a warning when this happens.
- Map to the “closest” non-overlapping interval in the set sense.
- Maybe have a user-supplied option to switch between some of these modes?
- How should we break ties? The current interval scoring uses the raw intersection between intervals, and then selects the first maximal target as the match. This seems wrong / under-determined when there are overlapping intervals. I think Jaccard similarity is a better selection criterion, being a proper metric and all. This would break ties in favor of the “tightest” matching interval, which I think is sensible.
A note on candidate filtering
@danagilliann and I worked out a slightly faster matching algorithm than what we currently have. It’s still O(nm)
in the worst case, but in the average case, should behave more like O(m log m + n log m)
.
Here’s some simplified pseudo-code of the current method (brute-force quadratic search):
output = []
for query in from_intervals:
best_score = 0
best_idx = 0
for idx, candidate in enumerate(to_intervals):
score = similarity(query, candidate)
if score > best_score:
best_score, best_idx = score, idx
output.append(best_idx)
The acceleration idea is to quickly filter out any intervals which we know to not intersect the query. This can be done using two sorted arrays, one for the interval starts and one for the interval ends. Any interval with start > query_end
will have score==0
; likewise for any interval with end < query_start
. Therefore, we only need to search over intervals with start <= query_end
and end >= query_start
, which can each be found in time O(log m)
.
The accelerated version of the algorithm looks as follows:
output = []
start_index = np.argsort(to_intervals[:, 0]) # sort index of the interval starts
end_index = np.argsort(to_intervals[:, 1]) # sort index of the interval ends
start_sorted = to_intervals[start_index, 0] # and sorted values of starts
end_sorted = to_intervals[end_index, 1] # and ends
for query in from_intervals:
# Find the intervals that start after our query ends
after_query = np.searchsorted(start_sorted, query[1], side='right')
# And the intervals that end after our query begins
before_query = np.searchsorted(end_sorted, query[0], side='left')
candidates = set(start_index[:after_query]) & set(end_index[before_query:])
# Proceed as before
best_score = 0
best_idx = 0
for idx in candidates:
score = similarity(query, to_intervals[idx])
if score > best_score:
best_score, best_idx = score, idx
output.append(best_idx)
The speedup here comes from the fact that we expect the candidates
set to be small most of the time (much less than the entire to_intervals
set).
As a side note, we can re-use these sorted arrays to quickly find the closest interval in the candidate set is empty, if that fallback mode is desired.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
(one more update)
Alternately, we could just use Hausdorff distance to support overlapping and disjoint calculations generally. This might rule out our efficiency hack above though.
Thanks for the reply. It’s true that versioning come into the Equation here for such a large userbase project as librosa. I would say that any contribution that brings us closer to 1.0 is worth having sooner than later, but that’s a matter of strategy on which I don’t have much insight.