Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Odd results on some datasets

See original GitHub issue

I’ve been trying UMAP out and I have one real dataset in particular where it does not seem to recover “obvious” structure. I thought it might be worth reporting in case it is a pathological case - or in case I am doing something wrong!

In this dataset there are two distinct groups. I’ve tried a few dimensionality reduction methods: PCA, metric MDS and tSNE all split the data into these two expected groups (albeit with different layout details). UMAP puts all the points onto a roughly one-dimensional diagonal line with no real separation between groups, as shown below:

pathological_plot

This sort of pattern seems to happen for various settings of n_neighbors (I tried turning it right down) and min_dist. I also tried init='random', which didn’t change anything. Should I try anything else?

There are quite a lot of zeroes in the data (which could be a cause of the odd behaviour?). Here’s a small slice of the data which doesn’t include all the points in the plot but does seem to show similar behaviour:

0.0	0.0	0.0	85.0	1.0	5.0	4092.0	5427.0	3.0
19.0	0.0	0.0	978.0	1.0	1.0	8739.0	16150.0	1.0
13.0	3.0	1.0	643.0	0.0	0.0	6669.0	11310.0	1.0
0.0	0.0	0.0	7107.0	3.0	11.0	6128.0	2054.0	0.0
0.0	0.0	0.0	5297.0	1.0	14.0	4769.0	1450.0	0.0
19220.0	2614.0	1666.0	12030.0	4700.0	1.0	0.0	0.0	0.0
9805.0	1548.0	1001.0	5820.0	2261.0	0.0	0.0	0.0	0.0
8859.0	1444.0	643.0	6157.0	2061.0	1.0	0.0	0.0	0.0
10740.0	1458.0	837.0	7858.0	2998.0	1.0	1.0	0.0	0.0
12030.0	1634.0	988.0	8304.0	3080.0	0.0	0.0	0.0	0.0

Each row here is a sample; it’s probably obvious to the eye that the first five of these rows are from one experimental group, and the second five are from the other.

@lmcinnes I’d be happy to send you the full dataset over a less public channel if that would help!

For contrast here is a tSNE plot for the same data:

tsne_comparison

PCA also creates a 2D layout and separates the blue and orange points (they are separated along the first principal component).

Is this behaviour expected or pathological? In particular the fact that the points all lie roughly along a diagonal line for the UMAP embedding seems odd.

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jlmelvillecommented, Nov 27, 2018

FWIW I ran the UMAP implementation in uwot with n_neighbors = 3 for the provided data and it did separate the data into two clusters, rather than a line. Does the problem manifest with the sample data?

0reactions

lmcinnescommented, Nov 27, 2018

Great! Thanks @claresloggett for making it easy, and thanks @jlmelville for letting me know that it was a problem in my version.

Top Results From Across the Web

ODDS – Outlier Detection DataSets

Outlier Detection DataSets (ODDS) Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.

5 Ways to Find Outliers in Your Data - Statistics By Jim

Outliers are data points that are far from other data points and they can distort statistical results. Learn how to find them in...

Learn to Deal with Imbalanced Dataset Classification - KNIME

Here we look at classification on imbalanced datasets. ... It is not unusual in machine learning applications to deal with imbalanced ...

8 Tactics to Combat Imbalanced Classes in Your Machine ...

Try various rebalancing methods and modeling algorithms with cross validation, then use the held back dataset to confirm any findings translate ...

Sampling Based Methods for Class Imbalance in Datasets

You train your classifier, and it yields 99.9% accuracy on your test set. You're overcome with joy by these results, but when you...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Odd results on some datasets

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Feature Request: Include loss_ as an attribute for the fitting

LinAlgError: the leading minor of order 11 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.