How can we apply the Gower metric to UMAP?
See original GitHub issueFrom my rough work, if we let the custom metric be the Gower metric, the distance matrix for all points in the dataset can be computed for both numerical and categorical data. However, it seems this is simplest when we only use the Gower metric for precomputing the distance matrix for the entire dataset, i.e. with
umap.UMAP(metric="precomputed").fit_transform(precomputed_distances)
While it is possible to compute the distance matrix for a dataset beforehand, using metric=“precomputed” is inappropriate towards a further transform on the embedding for new data, which is needed for inference, since it doesn’t allow for a .transform on the embedding for new data.
I think what I would want is to have a metric which can be plugged into umap.UMAP() such that this metric can handle both numerical and categorical features.
From the examples in the doc, it seems the metric is used for computing the distances between each pair of points separately (i.e. such a metric returns distance(point1, point2)),
I’m wondering how one could use the Gower distance metric for both fitting against training data and transforming on test data?
Or is transform for mixed datasets currently still unsupported despite the above?
This is important for me since I’m trying to use UMAP for dimensionality reduction on complex mixed datasets for inference/classification.
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (1 by maintainers)
Top GitHub Comments
Hey @AdamSpannbauer, due to the above issues with UMAP not being directly suitable towards mixed datasets and having non-negligible runtime overhead compared to some simpler methods, I did not choose to make any further progress on this path.
Notably, I ended up investigating FAMD - Factor Analysis of Mixed Data - instead, which is a union of linear techniques that can handle both numerical and categorical data. Perhaps you might be interested in taking a look there.
However, if you do want to further explore the option of creating a custom implementation of the Gower metric for UMAP, you may wish to refer to these existing standalone Gower metric implementations and try to “refit” those implementations to work with UMAP.
You would also have to develop the proper checks to handle mixed datasets with object columns. You can see here for an example of adding a distance metric: lmcinnes/pynndescent#86 (credit to @sleighsoft).
I think this would still be a worthwhile endeavor. Mixed datasets are very prevalent in a wide variety of data analysis situations.
Hi @simeng-yang, were you able to successfully implement Gower for UMAP? I’m interested in exploring the same thing, and I’d be very interested to see your implementation before starting from scratch.