Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Enhancement] Add support for sample_weight in the fit function

See original GitHub issue

The scikit-learn KMeans algorithm allows support for supplying a weight for each sample in the fit function. See the docs here.

Is this possible to add into the algorithm? i.e. can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples? I haven’t read into the MinCostFlow algorithm so I don’t know how feasible this would be.

Issue Analytics

State:
Created a year ago
Comments:8 (4 by maintainers)

Top GitHub Comments

4reactions

joshlkcommented, May 2, 2022

I’ve had a rethink…

I’ve had a look at how scikit-learn defines sample_weights:

The algorithm supports sample weights, which can be given by a parameter sample_weight. This allows to assign more weight to some samples when computing cluster centers and values of inertia. For example, assigning a weight of 2 to a sample is equivalent to adding a duplicate of that sample to the dataset .

Which I think its different to what I said:

I think it would be equivalent to weighting the distances.

and how you described it:

can we have the minimum and maximum bounds account for the sum of all weights instead of the count of all samples?

All of the above is possible - it’s just about figuring out what to weight. Feel free to have a shot at it. I will also have a longer think about what is needed

1reaction

hectoradrian961030commented, Jul 3, 2022

@joshlk I think I have a similar need. In the problem I’m trying to solve, size_max is the sum of the weights of a cluster instead the size of a cluster. A point of X is the centroid of a polygon and the weight of that polygon (point of X) is the sum of its vertices. Do you think the algorithm can be easily modified to handle this?