OPTICS test_reach_dists inappropriate
See original GitHub issueI begin to suspect the correctness of the test. Seems that our implementation and the referenced implementation are not the same. In our implementation, when there’re multiple points with the same reachability-distance, we choose the point which is closest to current point. In the referenced implementation, we choose the point with the smallest index. See the example below (taken from the referenced implementation):
X = np.array([[ 15., 70.], [ 31., 87.], [ 45., 32.], [ 32., 83.],
[ 26., 50.], [ 7., 31.], [ 43., 97.]])
###
# result from our implementation
clust = OPTICS(min_samples=5)
clust.fit(X)
clust.core_distances_
# array([38.89730068, 37.33630941, 52.63078947, 33.54101966, 33.54101966,
# 57.69748695, 49.979996 ])
clust.reachability_
# array([ inf, 33.54101966, 33.54101966, 38.89730068, 33.54101966,
# 33.54101966, 33.54101966])
clust.ordering_
# array([0, 3, 1, 6, 4, 2, 5])
###
# result from referenced implementation
RD, CD, order = optics(X, 4)
CD
# array([38.89730068, 37.33630941, 52.63078947, 33.54101966, 33.54101966,
# 57.69748695, 49.979996 ])
RD
# array([ 0. , 38.89730068, 33.54101966, 37.33630941, 33.54101966,
# 33.54101966, 33.54101966])
order
# [0, 1, 3, 4, 2, 5, 6]
We get the same CD, but different order (e.g., after choosing point 0, point 1&3&4&6 have the same RD, our implementation choose point 3 because it’s closest to point 0, the referenced implementation choose point 1 because it has the smallest index), thus different RD (RD of current point is only related to previous points) Correct me if I’m wrong @jnothman @espg @adrinjalali
Issue Analytics
- State:
- Created 5 years ago
- Comments:39 (39 by maintainers)
Top GitHub Comments
A) choosing the next unprocessed point on an empty heap: See Figure 5 in the original OPTICS paper. It prefers points in index order as starting points. But there is no particular reason for this except that it is very cheap. So this actually is in the paper.
B) sorting candidates by reachability, then index: It obviously is not necessary to do it this way; it is just an obvious approach to make results less random. In fact, I don’t really every object having an integer index much: once you think about working with dynamic data (and not just indexes). But in ELKI we have a heap-based and a list-based approach. By adding this tie breaker, both produce the same result, which makes unit testing easier. In particular if you replace the heap implementation. This makes the heap return objects in a deterministic order.
As for direct porting - I consider it valuable to see different interpretations. In particular of that Xi approach, because it does have some ambiguity. Maybe there is a different interpretation that works better. IMHO it is more important to follow certain data structure / efficiency considerations (such as the memory issue, and queries - what is actually in the paper) than about more subtle interpretation of what is not clear in the paper.
…I’m still running benchmarks on some larger datasets, but so far it looks like the cython performance gain is pretty marginal (i.e., about 20% faster at best, and tied with numpy in some cases). It may be worth removing quick_scan and going with the pure python module.