Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bug: sklearn.manifold.turstworthiness outputs values larger than 1

See original GitHub issue

The trustworthiness computation might be erroneous as for some cases I can see that it is providing values that are larger than 1.

Here is a code snippet that produces value larger than 1 in my machine:

import numpy as np
from sklearn.manifold import trustworthiness

#import sklearn; sklearn.show_versions()

np.random.seed(5000)

X_train = np.random.rand(7,4)
Y_train = np.random.rand(7,2)

tt = trustworthiness(X_train, Y_train, n_neighbors=5)

print('Computed Trustworthiness:', tt)

The output is:

Computed Trustworthiness: 1.0857142857142856

System: python: 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] executable: /usr/bin/python3 machine: Linux-5.4.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

Python dependencies: pip: 20.2.3 setuptools: 41.4.0 sklearn: 0.22.2.post1 numpy: 1.17.4 scipy: 1.4.1 Cython: None pandas: 1.0.5 matplotlib: 3.3.0 joblib: 0.11

Built with OpenMP: True

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

tariqul-islamcommented, Oct 10, 2020

I think I found the problem. There is no bug in the code. The bug is in the metric itself.

The metric requires (2.0 * n_samples - 3.0 * n_neighbors - 1.0) > 0, which translates to n_neighbors < (2.0 * n_samples - 1.0) / 3.0. For the example above it requires n_neighbors < (2.0*7-1.0)/3.0 = 4.33.

In the paper, it is written that “for clarity, we have only included the scaling for neighborhoods of size k<N/2”. I think we should add this line to the doc.

Or a check in the code to ensure that k<N/2. The reference code above can be modified as:

def trustworthiness(X, X_embedded, *, n_neighbors=5, metric='euclidean'):
    if n_neighbors >= X.shape[0]/2:
        warning("n_neighbors should be less than' + str(X.shape[0]/2) + 'given '+ str(n_neighbors))
    #other part of the codes
    return t

The way it’s written in the code doesn’t exactly match the description in the paper. Specifically, the code has a max function applied where the paper doesn’t. But simply removing it didn’t fix the problem (the result became negative).

The max function is implicit in the paper. It is defined as an intersection of two sets in original and projected data which translates to the max function in practice.

0reactions

taha-yassinecommented, Jan 27, 2022

I think I found the bug and I believe it is unrelated to the value of k being greater than N/2 which shouldn’t cause any issue. The problem actually comes from the following line (551): inverted_index = np.zeros((n_samples, n_samples), dtype=int) Here the array is initialized with dtype=int. Further down in the code a sum is performed using np.sum. In Numpy’s docs it is mentionned that np.sum returnes a result with the same dtype as the provided input, meaning that in our case the result of the sum is an int32 with a max value of 2147483648 which is easily reached in some cases. If the result of the sum exceeds this value, the variable t becomes negative. As a consequence. the final result becomes >1 which is erroneous. A simple fix would be to make inverted_index an int64 array which should leave enough room for the sum to never overflow. An additional check could be added to verify that the value stored in t is always positive, but that’s not very necessary IMO. I’m willing to submit a PR if my explanation is convincing enough.

Top Results From Across the Web

sklearn.manifold.trustworthiness

In other words, any unexpected nearest neighbors in the output space are penalised in proportion to their rank in the input space. Parameters:...

Version 1.1.3 — scikit-learn 1.2.0 documentation

This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures. Fix manifold.

Version 1.2.1 — scikit-learn 1.3.dev0 documentation

This is documentation for the unstable development version of Scikit-learn. (To use it, install the nightly build.) The latest stable release is version...

Version 0.21.3 — scikit-learn 1.2.0 documentation

Fix Fixed two bugs in metrics.pairwise_distances when n_jobs > 1 . ... the number of samples is larger than tens of thousands of...

Version 0.20.4 — scikit-learn 1.2.0 documentation

July 30, 2019. This is a bug-fix release with some bug fixes applied to version 0.20.3. ... for values of n_informative parameter larger...