Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add clustering

See original GitHub issue

implementation in sktime

create BaseClusterer
think about common unit tests for all clusterers
potentially add soft dependency
create sktime.clustering
template

from sktime.utils.validation.series_as_features import check_X

class Clusterer:

    def fit(self, X, y):
    
        X = check_X(X, enforce_univariate=True, convert_numpy=True)
        X = X.squeeze(1) # this gives you 2d np.array as in sklearn
        # ...
        
        return self
        
    def predict(self, X):
        X = check_X(X, enforce_univariate=True, convert_numpy=True)
        # ...
        
        return # cluster labels

interfacing sklearn

take a look at tslearn

bigger-picture design ideas

check out https://github.com/alan-turing-institute/sktime/issues/52

minimal viable product

KNN with dtw

Related software: time series distances

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

chrisholdercommented, Jun 4, 2021

Going to start reviving this issue as the base clustering and the initial partitioning based algorithms have been implemented. At the time of writing all the clustering code is on the ‘clustering’ branch and will hopefully be merged in this weekend to main. In this clustering branch the base classes, K-means and K-medoids (soon to be followed by K-shapes just finishing the docstring for it) have been implemented. Ill quickly just explain the state of this branch and what is left to do for these, and explain whats on my immediate todo.

So the current clustering branch, as mentioned includes: K-means and K-medoids (and shortly will also have K-shapes). These algorithms effectivly work the same except for how they either calculate centers or the distance they use. As we need to be able to change the distance and how the next centers are calculated I have essentailly written from scratch the ‘k-means’ clustering algorithms and made it generic in the file sktime/clustering/partitioning/_time_series_k_partition.py. I could not use the sklearn implementation of k-means as sklearn does not let you change the distance nor the centering averaging method which is critical when using different distance measures. I have allowed in my implementation ( _time_series_k_parition.py) for complete control over using different distances (passed as metric in parameters), how the centers are first initialised (passes as init_algorithm iin parameters) and finally how the new centers are caculated each iteration by defining the abstract method calculate_new_centers which it is then for the specific algorithm how it is done.

Ill now define what is supported for paritioning clustering algorithms in terms of distances, center initilisation and updating centers:

Distances (though ive only tested euclidean and dtw):

euclidean
dtw
ddtw
wdtw
wddtw
lcss
erp
msm
twe (half of these I have no idea what they are but saw them in elastic and assume they were distances and could be used).

Center initialisation:

random initailisation. I am currently working on kmeans++ initialisation algorithm hoping to get that in with the k-shapes patch

New center calculation:

K-means - Mean (this literally is the mean average) - DTW Barycenter averaging (or DBA)
K-medoids - Medoid

So overall a good start. So im going going to put below a summary of what I’m currently working on and my immediate todo list for this coming week and anyone else who wants to hop in and do some stuff is welcome:

Im currently working on:

K-shapes algorithm

TODO list (feel free to choose something you’d like to do)

K-means++ initialisation algorithm
Massive optimisation of the DBA algorithm
- To do this I think im going to need to reimplement the DTW in numba as we need the cost matrix’s associated with DTW so cant use the existing sktime DTW.
Example notebooks. Notebooks explaining how these algorithms work and how to use them
Unit tests - my unit tests are sorry excuses. We need to establish HOW we’re going to unit test the clustering algorithms (im going to have a look at how sklearn and tslearn have tested theres).
Sklearn has another parameters ive skipped called ‘n_init’ which is the number of times the algorithm is run with different centroid seeds and the final results is the best output of n_init. This is probably a good idea to implement
Sklearn also appears to have some inteligent checks where you can pass a parameter called ‘tol’ which is the relative tolerance with regards to Forbenius norm of the difference in the cluster centers of two consecutive iterations to declare convergenece. So being able to smartly detect convergence could be a great optimisation.
Parallelism would be nice to have
Clustering evaluation - pretty sure we can just use the sklearn stuff but making sure it works and potentailly adding our own api to it and creating plots for time series specific data could be a nice addition

1reaction

chrisholdercommented, Jun 4, 2021

Just going to follow this comment up with how im going about deciding what to implement and where im referencing most of my stuff. At the moment im basically just implementing the state of the art for partitioning as it is the most immediate algorithms I need for my Masters. What I am hoping to achieve over the rest of summer is to get to the same point in terms of algorithm coverage as the ‘leading’ time series clustering package which is probably an R library called dtwclust. I am also of course looking at the Sklearn clustering algorithm and hope to get all of these implemented.

More information on dtwclust can be found here: https://github.com/asardaes/dtwclust. Specifically im looking at https://cran.r-project.org/web/packages/dtwclust/vignettes/dtwclust.pdf for finer details on specific implementations however, im also hoping we can implement further algorithms such as DNNs clustering (cant cite the paper Im referring to for this as it is still under review) and shapelet based approaches (see https://ieeexplore.ieee.org/document/6413851) for example. Overall I hope we can create the most complete set of time series clustering algorithms across paritioning, density and hierarchical approches and am really looking forward to collaborating with others to make this happen!

Top Results From Across the Web

12 Attribute Clustering

Attribute clustering is a table-level directive that clusters data in close physical proximity based on the content of certain columns.

Clustering Keys & Clustered Tables - Snowflake Documentation

To improve the clustering of the underlying table micro-partitions, you can always manually sort rows on key table columns and re-insert them into...

Create a failover cluster | Microsoft Learn

This topic shows how to create a failover cluster by using either the Failover Cluster Manager snap-in or Windows PowerShell.

Oracle add clustering by linear order (attribute clustering)

Oracle notes that "Just like compression, attribute clustering is a directive that transparently kicks in for certain operations, namely direct path insertion ...

Create and use clustered tables | BigQuery - Google Cloud

Create, control access, and use clustered tables. ... Option 2: Click add_box Add field and enter the table schema. Specify each field's Name,...