Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

General match_intervals improvements

See original GitHub issue

Description

In auditing match_intervals in preparation for numbafication #157, I noticed a couple of undocumented / undefined behaviors that should be made explicit. I’m tagging these as “bug” because they are definitely not intended behavior, and could silently cause some strange behaviors down the road.

What should happen when a query interval has no overlap with any of the targets? The current logic for this is here, which would amount to argmax over an all zeros vector. np.argmax breaks ties in favor of the first maximal element, which in this case, would be 0 (the first target interval). This seems bad and wrong here.

Some suggested alternative behaviors:

Use a null value. None seems reasonable, but it would require returning a list of ints instead of a ndarray(dtype=int). NaN also seems reasonable, but would require dtype=float, which would not work with fancy indexing.
Throw an exception. At the very least, we should throw a warning when this happens.
Map to the “closest” non-overlapping interval in the set sense.
Maybe have a user-supplied option to switch between some of these modes?

How should we break ties? The current interval scoring uses the raw intersection between intervals, and then selects the first maximal target as the match. This seems wrong / under-determined when there are overlapping intervals. I think Jaccard similarity is a better selection criterion, being a proper metric and all. This would break ties in favor of the “tightest” matching interval, which I think is sensible.

A note on candidate filtering

@danagilliann and I worked out a slightly faster matching algorithm than what we currently have. It’s still O(nm) in the worst case, but in the average case, should behave more like O(m log m + n log m).

Here’s some simplified pseudo-code of the current method (brute-force quadratic search):

output = []
for query in from_intervals:
   best_score = 0
   best_idx = 0
   for idx, candidate in enumerate(to_intervals):
      score = similarity(query, candidate)
      if score > best_score:
          best_score, best_idx = score, idx
   output.append(best_idx)

The acceleration idea is to quickly filter out any intervals which we know to not intersect the query. This can be done using two sorted arrays, one for the interval starts and one for the interval ends. Any interval with start > query_end will have score==0; likewise for any interval with end < query_start. Therefore, we only need to search over intervals with start <= query_end and end >= query_start, which can each be found in time O(log m).

The accelerated version of the algorithm looks as follows:

output = []
start_index = np.argsort(to_intervals[:, 0])  # sort index of the interval starts
end_index = np.argsort(to_intervals[:, 1])  # sort index of the interval ends

start_sorted = to_intervals[start_index, 0]  # and sorted values of starts
end_sorted = to_intervals[end_index, 1]  # and ends

for query in from_intervals:
   # Find the intervals that start after our query ends
   after_query = np.searchsorted(start_sorted, query[1], side='right')
   # And the intervals that end after our query begins
   before_query = np.searchsorted(end_sorted, query[0], side='left')

   candidates = set(start_index[:after_query]) & set(end_index[before_query:])

   # Proceed as before
   best_score = 0
   best_idx = 0
   for idx in candidates:
      score = similarity(query, to_intervals[idx])
      if score > best_score:
          best_score, best_idx = score, idx
   output.append(best_idx)

The speedup here comes from the fact that we expect the candidates set to be small most of the time (much less than the entire to_intervals set).

As a side note, we can re-use these sorted arrays to quickly find the closest interval in the candidate set is empty, if that fallback mode is desired.

Issue Analytics

State:
Created 5 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

bmcfeecommented, May 7, 2018

(one more update)

Alternately, we could just use Hausdorff distance to support overlapping and disjoint calculations generally. This might rule out our efficiency hack above though.

0reactions

lostanlencommented, May 16, 2018

Thanks for the reply. It’s true that versioning come into the Equation here for such a large userbase project as librosa. I would say that any contribution that brings us closer to 1.0 is worth having sooner than later, but that’s a matter of strategy on which I don’t have much insight.

Top Results From Across the Web

Match intervals ͑ defined as the absolute frequency difference ...

An improvement due to F0 difference was correlated with spectral differences between vowels; however, simple models based on acoustic and cochlear spectral ...

Federal Register/Vol. 84, No. 120/Friday, June 21, 2019 ...

improvements in voluntary container ... In general, CNG heavy vehicles are ... which would match intervals and procedures with.

RCED-84-112 Better Wage-Matching Systems and Procedures ...

General. Food and Nutrition. Service; and your Inspector. Sincerely ... some further improvements in its system could enhance its wage- matching efforts.

5.4 Meshing Schemes - Coreform

This command will match intervals on the given entity, then mesh any unmeshed ... In general, automatic scheme selection reduces the amount of...

On Zwicker tones and musical pitch in the likely absence of ...

The general trial structure was the same for both reference tone ... musical interval adjustments were indeed improved (or at least were equally...