Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation: Proper DataSource format and usage for K-Means Clustering

See original GitHub issue

Is your feature request related to a problem? Please describe. Still a newbie to this library, so thanks for bearing with me.

Right now, the documentation shows how to run K-Means clustering on an auto-generated data set of Gaussian clusters. This is great, as it shows K-Means is possible, but (unless I’m missing something) it does not show the steps to input real data. (It mentions You can also use any of the standard data loaders to pull in clustering data. but I don’t see where that’s documented).

I’ve figured out how to load a CSV file of features and metadata (thanks to your new Colunmar tutorial), but I can’t seem to infer how to connect this data with KMeansTrainer, or if that’s even the right approach.

Describe the solution you’d like A clear and concise description/example of how to load real-world (non-autogenerated) data into the K-Means algorithm.

Describe alternatives you’ve considered Looking through JavaDocs, but having trouble knowing what to focus on.

Additional context

Issue Analytics

State:
Created 3 years ago
Comments:32 (14 by maintainers)

Top GitHub Comments

1reaction

lincolnthreecommented, Oct 22, 2020

Hot dog!

Number of examples = 500
Number of features = 556
Label domain = []
Example = ArrayExample(numFeatures=21,output=-1,weight=1.0,metadata={name=Four-Color Omnath, id=876c6326-a40d-438b-89c0-825e647370d0},features=[(cards@1-N=0, 2.0)(cards@1-N=1, 1.0), (cards@1-N=10, 4.0), (cards@1-N=11, 4.0), (cards@1-N=12, 4.0), (cards@1-N=13, 5.0), (cards@1-N=14, 4.0), (cards@1-N=15, 3.0), (cards@1-N=16, 4.0), (cards@1-N=17, 2.0), (cards@1-N=18, 3.0), (cards@1-N=19, 2.0), (cards@1-N=2, 4.0), (cards@1-N=3, 3.0), (cards@1-N=4, 2.0), (cards@1-N=5, 1.0), (cards@1-N=6, 4.0), (cards@1-N=7, 2.0), (cards@1-N=8, 2.0), (cards@1-N=9, 4.0), (format@standard, 1.0), ])

0reactions

Craigacpcommented, Nov 30, 2020

We’ve also merged in an empty response processor implementation for use when loading clustering, anomaly detection or other datasets where you don’t expect there to be a ground truth output. I’m going to close this issue now as I think we’ve patched the usability issues you hit. Open a fresh one if you hit others, or re-open this if you think it’s not quite covered by PRs #99 and #98.

Top Results From Across the Web

K-means Clustering: Algorithm, Applications, Evaluation ...

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups ...

K Means Clustering Algorithm in Python - Analytics Vidhya

K means clustering is an iterative algorithm. A Complete guide to Learn about k means clustering and how to implement k means clustering...

K-Means Clustering in R: Algorithm and Practical Examples

The simplified format is kmeans(x, centers), where “x” is the data and centers is the number of clusters to be produced.

K-means Cluster Analysis

K-means clustering is the simplest and the most commonly used clustering method for splitting a dataset into a set of k groups.

K-Means Clustering Algorithm: Applications, Types, and How ...

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Documentation: Proper DataSource format and usage for K-Means Clustering

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Versions referred to in the pom.xml and tutorials may not be in sync with release 4.0.2

Extend the CSVLoader class to read from different datasources/targets and different kinds of formats