bug: sklearn.manifold.turstworthiness outputs values larger than 1
See original GitHub issueThe trustworthiness computation might be erroneous as for some cases I can see that it is providing values that are larger than 1.
Here is a code snippet that produces value larger than 1 in my machine:
import numpy as np
from sklearn.manifold import trustworthiness
#import sklearn; sklearn.show_versions()
np.random.seed(5000)
X_train = np.random.rand(7,4)
Y_train = np.random.rand(7,2)
tt = trustworthiness(X_train, Y_train, n_neighbors=5)
print('Computed Trustworthiness:', tt)
The output is:
Computed Trustworthiness: 1.0857142857142856
System: python: 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] executable: /usr/bin/python3 machine: Linux-5.4.0-47-generic-x86_64-with-Ubuntu-18.04-bionic
Python dependencies: pip: 20.2.3 setuptools: 41.4.0 sklearn: 0.22.2.post1 numpy: 1.17.4 scipy: 1.4.1 Cython: None pandas: 1.0.5 matplotlib: 3.3.0 joblib: 0.11
Built with OpenMP: True
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
sklearn.manifold.trustworthiness
In other words, any unexpected nearest neighbors in the output space are penalised in proportion to their rank in the input space. Parameters:...
Read more >Version 1.1.3 — scikit-learn 1.2.0 documentation
This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures. Fix manifold.
Read more >Version 1.2.1 — scikit-learn 1.3.dev0 documentation
This is documentation for the unstable development version of Scikit-learn. (To use it, install the nightly build.) The latest stable release is version...
Read more >Version 0.21.3 — scikit-learn 1.2.0 documentation
Fix Fixed two bugs in metrics.pairwise_distances when n_jobs > 1 . ... the number of samples is larger than tens of thousands of...
Read more >Version 0.20.4 — scikit-learn 1.2.0 documentation
July 30, 2019. This is a bug-fix release with some bug fixes applied to version 0.20.3. ... for values of n_informative parameter larger...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think I found the problem. There is no bug in the code. The bug is in the metric itself.
The metric requires
(2.0 * n_samples - 3.0 * n_neighbors - 1.0) > 0
, which translates ton_neighbors < (2.0 * n_samples - 1.0) / 3.0
. For the example above it requiresn_neighbors < (2.0*7-1.0)/3.0 = 4.33
.In the paper, it is written that “for clarity, we have only included the scaling for neighborhoods of size
k<N/2
”. I think we should add this line to the doc.Or a check in the code to ensure that
k<N/2
. The reference code above can be modified as:The max function is implicit in the paper. It is defined as an intersection of two sets in original and projected data which translates to the max function in practice.
I think I found the bug and I believe it is unrelated to the value of
k
being greater thanN/2
which shouldn’t cause any issue. The problem actually comes from the following line (551):inverted_index = np.zeros((n_samples, n_samples), dtype=int)
Here the array is initialized withdtype=int
. Further down in the code a sum is performed usingnp.sum
. In Numpy’s docs it is mentionned thatnp.sum
returnes a result with the samedtype
as the provided input, meaning that in our case the result of the sum is anint32
with a max value of 2147483648 which is easily reached in some cases. If the result of the sum exceeds this value, the variablet
becomes negative. As a consequence. the final result becomes >1 which is erroneous. A simple fix would be to makeinverted_index
anint64
array which should leave enough room for the sum to never overflow. An additional check could be added to verify that the value stored int
is always positive, but that’s not very necessary IMO. I’m willing to submit a PR if my explanation is convincing enough.