question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How can we apply the Gower metric to UMAP?

See original GitHub issue

From my rough work, if we let the custom metric be the Gower metric, the distance matrix for all points in the dataset can be computed for both numerical and categorical data. However, it seems this is simplest when we only use the Gower metric for precomputing the distance matrix for the entire dataset, i.e. with

umap.UMAP(metric="precomputed").fit_transform(precomputed_distances)

While it is possible to compute the distance matrix for a dataset beforehand, using metric=“precomputed” is inappropriate towards a further transform on the embedding for new data, which is needed for inference, since it doesn’t allow for a .transform on the embedding for new data.

I think what I would want is to have a metric which can be plugged into umap.UMAP() such that this metric can handle both numerical and categorical features.

From the examples in the doc, it seems the metric is used for computing the distances between each pair of points separately (i.e. such a metric returns distance(point1, point2)),

I’m wondering how one could use the Gower distance metric for both fitting against training data and transforming on test data?

Or is transform for mixed datasets currently still unsupported despite the above?

This is important for me since I’m trying to use UMAP for dimensionality reduction on complex mixed datasets for inference/classification.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
simeng-yangcommented, Mar 31, 2020

Hey @AdamSpannbauer, due to the above issues with UMAP not being directly suitable towards mixed datasets and having non-negligible runtime overhead compared to some simpler methods, I did not choose to make any further progress on this path.

Notably, I ended up investigating FAMD - Factor Analysis of Mixed Data - instead, which is a union of linear techniques that can handle both numerical and categorical data. Perhaps you might be interested in taking a look there.

However, if you do want to further explore the option of creating a custom implementation of the Gower metric for UMAP, you may wish to refer to these existing standalone Gower metric implementations and try to “refit” those implementations to work with UMAP.

You would also have to develop the proper checks to handle mixed datasets with object columns. You can see here for an example of adding a distance metric: lmcinnes/pynndescent#86 (credit to @sleighsoft).

I think this would still be a worthwhile endeavor. Mixed datasets are very prevalent in a wide variety of data analysis situations.

0reactions
AdamSpannbauercommented, Mar 31, 2020

Hi @simeng-yang, were you able to successfully implement Gower for UMAP? I’m interested in exploring the same thing, and I’d be very interested to see your implementation before starting from scratch.

Read more comments on GitHub >

github_iconTop Results From Across the Web

UMAP API Guide — umap 0.5 documentation - Read the Docs
The metric to use to compute distances in high dimensional space. ... Weighting applied to negative samples in low dimensional embedding optimization.
Read more >
The Ultimate Guide for Clustering Mixed Data | Analytics Vidhya
Gower's distance is a metric used to measure the similarity between two data points ... In order to apply UMAP to mixed data...
Read more >
spelk24/BeerRecommonderApp: Methodology ... - Rdrr.io
UMAP is a dimensionality reduction technique that uses neighbor graphs to project data down to a lower space. In simple terms, I want...
Read more >
UMAP embedding visualization of the second and third ...
UMAP was applied to analyse gene expression data (Becht et al. ... then projected on the constructed space using Gower's formula in order...
Read more >
Gower distance matrix vs. One-hot encoding for mixed ...
My goal is one of these results (gower or one-hot encoded) use as an input in UMAP or an autoencoder. Then get clusters...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found