Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Supervised dimesion reduction overfitting

See original GitHub issue

Apologies if this is a dumb question, but what’s the best way to understand level of overfitting when using supervised approach?

In example below, the separation of points is very clear despite random input. I understand the reason for it, but don’t know how to asses the extent to which separation is driven by overfitting vs by “real” differences in data generating process. Any suggestions? Masking some samples works somewhat, but only if there are enough left for fitting after mask which there aren’t in my data.

import umap
import numpy as np
from matplotlib import pyplot as plt

testSamples=400
randomrows=np.random.randint(0,2,size=(testSamples,50))
testMetadata=np.random.randint(0,2,size=testSamples)
fitter = umap.UMAP(n_neighbors=25, min_dist=0.1, metric='hamming').fit(randomrows, y=testMetadata)

#uncomment these to run semi-supervised
#testMetadata_masked=testMetadata
#testMetadata_masked[np.random.choice(len(testMetadata_masked), size=50, replace=False)] = -1
#fitter = umap.UMAP(n_neighbors=25, min_dist=0.1, metric='hamming').fit(randomrows, y=testMetadata_masked)

embedding  = fitter.embedding_

plt.scatter(embedding[:,0],embedding[:,1],  s=5, c=testMetadata)
plt.show()

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

jc-healycommented, Oct 19, 2021

That is one way to think about it but I think it’s a bit misleading. I think that a better way to think about it is that as you drop the n_neighbour you weaken all of your signal and start fitting the fiddly bits in your manifold. This definitely weakens your supervised signal and prevents it from dominating your space but it also weakens the signal from your original space. Remember you are experimenting in the presence of no structure here.

I think the best way to think about fully supervised UMAP is that you’ve got two embeddings and you are folding them together. One has perfect clustering and the other is random noise. I’m going to represent these embeddings via nearest neighbour graphs (which I’ll fold together). As I reduce n_neighbours I’m inducing less edges in both graphs. This has less of those very consistent supervised edges to as compared to the random edges. As such it becomes more likely that a few of the random edges agree with our supervised edges and you get that mixing structure.

I’d recommend exploring this trade off by examining the effect these things have in the presence of structured data and labels. I probably should turn this into a read the docs page but finding time for such things can be challenging. In the meantime here is a slight modification of your code that might help provide some intuition.

import umap
import umap.plot
import numpy as np
from sklearn.datasets import make_swiss_roll

testSamples=400
randomrows = make_swiss_roll(testSamples)
testMetadata=np.random.randint(0,2,size=testSamples)
fitter = umap.UMAP(n_neighbors=15, min_dist=0.1, target_weight=0).fit(randomrows[0], y=testMetadata)

umap.plot.points(fitter, values=randomrows[1], theme='fire', width=400, height=400)
umap.plot.points(fitter, labels=testMetadata, theme='fire', width=400, height=400)

And here are the same images with n_neighbours turned down to 5. You’ll see that while we are indeed weakening the supervised structure we are also weakening and tearing apart our unsupervised structure.

Hopefully this is helpful.

1reaction

jc-healycommented, Oct 19, 2021

Another way you could do this, with less extreme data, would be to use the target_weight parameter and drive it down to 0 to try and de-emphasize your supervised distance. Unfortunately, there is so little structure contained within your distance that the clouds still separate into two distinct blobs even when setting the target_weight to the minimum value of 0. That said it’s an easy way to go when you’ve got less extreme data. target_weight: float (optional, default 0.5) weighting factor between data topology and target topology. A value of 0.0 weights predominantly on data, a value of 1.0 places a strong emphasis on target. The default of 0.5 balances the weighting equally between data and target.

Top Results From Across the Web

Supervised dimensionality reduction for big data - Nature

We have introduced a very simple methodology to improve performance on supervised learning problems with wide data (that is, big data where ...

Avoid overfitting with feature selection and dimensionality ...

Another common approach of reducing dimensionality reduction approach is to transform high-dimensional data in lower-dimensional space. This transformation ...

2) Reduce overfitting: Feature reduction and Dropouts

This is Part 2 of our article on how to reduce overfitting. ... to Reduce the number of features is also termed Dimensionality...

How to Mitigate Overfitting with Dimensionality Reduction

Dimensionality reduction (DR) is another useful technique that can be used to mitigate overfitting in machine learning models.

The Effect of Different Dimensionality Reduction Techniques ...

comparative study of nine dimensionality reduction methods. ... rough set; overfitting; underfitting; machine learning ... supervised learning.