Odd results on some datasets
See original GitHub issueI’ve been trying UMAP out and I have one real dataset in particular where it does not seem to recover “obvious” structure. I thought it might be worth reporting in case it is a pathological case - or in case I am doing something wrong!
In this dataset there are two distinct groups. I’ve tried a few dimensionality reduction methods: PCA, metric MDS and tSNE all split the data into these two expected groups (albeit with different layout details). UMAP puts all the points onto a roughly one-dimensional diagonal line with no real separation between groups, as shown below:
This sort of pattern seems to happen for various settings of n_neighbors
(I tried turning it right down) and min_dist
. I also tried init='random'
, which didn’t change anything. Should I try anything else?
There are quite a lot of zeroes in the data (which could be a cause of the odd behaviour?). Here’s a small slice of the data which doesn’t include all the points in the plot but does seem to show similar behaviour:
0.0 0.0 0.0 85.0 1.0 5.0 4092.0 5427.0 3.0
19.0 0.0 0.0 978.0 1.0 1.0 8739.0 16150.0 1.0
13.0 3.0 1.0 643.0 0.0 0.0 6669.0 11310.0 1.0
0.0 0.0 0.0 7107.0 3.0 11.0 6128.0 2054.0 0.0
0.0 0.0 0.0 5297.0 1.0 14.0 4769.0 1450.0 0.0
19220.0 2614.0 1666.0 12030.0 4700.0 1.0 0.0 0.0 0.0
9805.0 1548.0 1001.0 5820.0 2261.0 0.0 0.0 0.0 0.0
8859.0 1444.0 643.0 6157.0 2061.0 1.0 0.0 0.0 0.0
10740.0 1458.0 837.0 7858.0 2998.0 1.0 1.0 0.0 0.0
12030.0 1634.0 988.0 8304.0 3080.0 0.0 0.0 0.0 0.0
Each row here is a sample; it’s probably obvious to the eye that the first five of these rows are from one experimental group, and the second five are from the other.
@lmcinnes I’d be happy to send you the full dataset over a less public channel if that would help!
For contrast here is a tSNE plot for the same data:
PCA also creates a 2D layout and separates the blue and orange points (they are separated along the first principal component).
Is this behaviour expected or pathological? In particular the fact that the points all lie roughly along a diagonal line for the UMAP embedding seems odd.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
FWIW I ran the UMAP implementation in uwot with
n_neighbors = 3
for the provided data and it did separate the data into two clusters, rather than a line. Does the problem manifest with the sample data?Great! Thanks @claresloggett for making it easy, and thanks @jlmelville for letting me know that it was a problem in my version.