question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Odd results on some datasets

See original GitHub issue

I’ve been trying UMAP out and I have one real dataset in particular where it does not seem to recover “obvious” structure. I thought it might be worth reporting in case it is a pathological case - or in case I am doing something wrong!

In this dataset there are two distinct groups. I’ve tried a few dimensionality reduction methods: PCA, metric MDS and tSNE all split the data into these two expected groups (albeit with different layout details). UMAP puts all the points onto a roughly one-dimensional diagonal line with no real separation between groups, as shown below:

pathological_plot

This sort of pattern seems to happen for various settings of n_neighbors (I tried turning it right down) and min_dist. I also tried init='random', which didn’t change anything. Should I try anything else?

There are quite a lot of zeroes in the data (which could be a cause of the odd behaviour?). Here’s a small slice of the data which doesn’t include all the points in the plot but does seem to show similar behaviour:

0.0	0.0	0.0	85.0	1.0	5.0	4092.0	5427.0	3.0
19.0	0.0	0.0	978.0	1.0	1.0	8739.0	16150.0	1.0
13.0	3.0	1.0	643.0	0.0	0.0	6669.0	11310.0	1.0
0.0	0.0	0.0	7107.0	3.0	11.0	6128.0	2054.0	0.0
0.0	0.0	0.0	5297.0	1.0	14.0	4769.0	1450.0	0.0
19220.0	2614.0	1666.0	12030.0	4700.0	1.0	0.0	0.0	0.0
9805.0	1548.0	1001.0	5820.0	2261.0	0.0	0.0	0.0	0.0
8859.0	1444.0	643.0	6157.0	2061.0	1.0	0.0	0.0	0.0
10740.0	1458.0	837.0	7858.0	2998.0	1.0	1.0	0.0	0.0
12030.0	1634.0	988.0	8304.0	3080.0	0.0	0.0	0.0	0.0

Each row here is a sample; it’s probably obvious to the eye that the first five of these rows are from one experimental group, and the second five are from the other.

@lmcinnes I’d be happy to send you the full dataset over a less public channel if that would help!

For contrast here is a tSNE plot for the same data:

tsne_comparison

PCA also creates a 2D layout and separates the blue and orange points (they are separated along the first principal component).

Is this behaviour expected or pathological? In particular the fact that the points all lie roughly along a diagonal line for the UMAP embedding seems odd.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jlmelvillecommented, Nov 27, 2018

FWIW I ran the UMAP implementation in uwot with n_neighbors = 3 for the provided data and it did separate the data into two clusters, rather than a line. Does the problem manifest with the sample data?

0reactions
lmcinnescommented, Nov 27, 2018

Great! Thanks @claresloggett for making it easy, and thanks @jlmelville for letting me know that it was a problem in my version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ODDS – Outlier Detection DataSets
Outlier Detection DataSets (ODDS) Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
Read more >
5 Ways to Find Outliers in Your Data - Statistics By Jim
Outliers are data points that are far from other data points and they can distort statistical results. Learn how to find them in...
Read more >
Learn to Deal with Imbalanced Dataset Classification - KNIME
Here we look at classification on imbalanced datasets. ... It is not unusual in machine learning applications to deal with imbalanced ...
Read more >
8 Tactics to Combat Imbalanced Classes in Your Machine ...
Try various rebalancing methods and modeling algorithms with cross validation, then use the held back dataset to confirm any findings translate ...
Read more >
Sampling Based Methods for Class Imbalance in Datasets
You train your classifier, and it yields 99.9% accuracy on your test set. You're overcome with joy by these results, but when you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found