question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bug: sklearn.manifold.turstworthiness outputs values larger than 1

See original GitHub issue

The trustworthiness computation might be erroneous as for some cases I can see that it is providing values that are larger than 1.

Here is a code snippet that produces value larger than 1 in my machine:

import numpy as np
from sklearn.manifold import trustworthiness

#import sklearn; sklearn.show_versions()

np.random.seed(5000)

X_train = np.random.rand(7,4)
Y_train = np.random.rand(7,2)

tt = trustworthiness(X_train, Y_train, n_neighbors=5)

print('Computed Trustworthiness:', tt)

The output is:

Computed Trustworthiness: 1.0857142857142856

System: python: 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] executable: /usr/bin/python3 machine: Linux-5.4.0-47-generic-x86_64-with-Ubuntu-18.04-bionic

Python dependencies: pip: 20.2.3 setuptools: 41.4.0 sklearn: 0.22.2.post1 numpy: 1.17.4 scipy: 1.4.1 Cython: None pandas: 1.0.5 matplotlib: 3.3.0 joblib: 0.11

Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
tariqul-islamcommented, Oct 10, 2020

I think I found the problem. There is no bug in the code. The bug is in the metric itself.

The metric requires (2.0 * n_samples - 3.0 * n_neighbors - 1.0) > 0, which translates to n_neighbors < (2.0 * n_samples - 1.0) / 3.0. For the example above it requires n_neighbors < (2.0*7-1.0)/3.0 = 4.33.

In the paper, it is written that “for clarity, we have only included the scaling for neighborhoods of size k<N/2”. I think we should add this line to the doc.

Or a check in the code to ensure that k<N/2. The reference code above can be modified as:

def trustworthiness(X, X_embedded, *, n_neighbors=5, metric='euclidean'):
    if n_neighbors >= X.shape[0]/2:
        warning("n_neighbors should be less than' + str(X.shape[0]/2) + 'given '+ str(n_neighbors))
    #other part of the codes
    return t

The way it’s written in the code doesn’t exactly match the description in the paper. Specifically, the code has a max function applied where the paper doesn’t. But simply removing it didn’t fix the problem (the result became negative).

The max function is implicit in the paper. It is defined as an intersection of two sets in original and projected data which translates to the max function in practice.

0reactions
taha-yassinecommented, Jan 27, 2022

I think I found the bug and I believe it is unrelated to the value of k being greater than N/2 which shouldn’t cause any issue. The problem actually comes from the following line (551): inverted_index = np.zeros((n_samples, n_samples), dtype=int) Here the array is initialized with dtype=int. Further down in the code a sum is performed using np.sum. In Numpy’s docs it is mentionned that np.sum returnes a result with the same dtype as the provided input, meaning that in our case the result of the sum is an int32 with a max value of 2147483648 which is easily reached in some cases. If the result of the sum exceeds this value, the variable t becomes negative. As a consequence. the final result becomes >1 which is erroneous. A simple fix would be to make inverted_index an int64 array which should leave enough room for the sum to never overflow. An additional check could be added to verify that the value stored in t is always positive, but that’s not very necessary IMO. I’m willing to submit a PR if my explanation is convincing enough.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.manifold.trustworthiness
In other words, any unexpected nearest neighbors in the output space are penalised in proportion to their rank in the input space. Parameters:...
Read more >
Version 1.1.3 — scikit-learn 1.2.0 documentation
This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures. Fix manifold.
Read more >
Version 1.2.1 — scikit-learn 1.3.dev0 documentation
This is documentation for the unstable development version of Scikit-learn. (To use it, install the nightly build.) The latest stable release is version...
Read more >
Version 0.21.3 — scikit-learn 1.2.0 documentation
Fix Fixed two bugs in metrics.pairwise_distances when n_jobs > 1 . ... the number of samples is larger than tens of thousands of...
Read more >
Version 0.20.4 — scikit-learn 1.2.0 documentation
July 30, 2019. This is a bug-fix release with some bug fixes applied to version 0.20.3. ... for values of n_informative parameter larger...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found