question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

implementation in sktime

  • create BaseClusterer
  • think about common unit tests for all clusterers
  • potentially add soft dependency
  • create sktime.clustering
  • template
from sktime.utils.validation.series_as_features import check_X

class Clusterer:

    def fit(self, X, y):
    
        X = check_X(X, enforce_univariate=True, convert_numpy=True)
        X = X.squeeze(1) # this gives you 2d np.array as in sklearn
        # ...
        
        return self
        
    def predict(self, X):
        X = check_X(X, enforce_univariate=True, convert_numpy=True)
        # ...
        
        return # cluster labels 

interfacing sklearn

bigger-picture design ideas

minimal viable product

  • KNN with dtw

Related software: time series distances

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
chrisholdercommented, Jun 4, 2021

Going to start reviving this issue as the base clustering and the initial partitioning based algorithms have been implemented. At the time of writing all the clustering code is on the ‘clustering’ branch and will hopefully be merged in this weekend to main. In this clustering branch the base classes, K-means and K-medoids (soon to be followed by K-shapes just finishing the docstring for it) have been implemented. Ill quickly just explain the state of this branch and what is left to do for these, and explain whats on my immediate todo.

So the current clustering branch, as mentioned includes: K-means and K-medoids (and shortly will also have K-shapes). These algorithms effectivly work the same except for how they either calculate centers or the distance they use. As we need to be able to change the distance and how the next centers are calculated I have essentailly written from scratch the ‘k-means’ clustering algorithms and made it generic in the file sktime/clustering/partitioning/_time_series_k_partition.py. I could not use the sklearn implementation of k-means as sklearn does not let you change the distance nor the centering averaging method which is critical when using different distance measures. I have allowed in my implementation ( _time_series_k_parition.py) for complete control over using different distances (passed as metric in parameters), how the centers are first initialised (passes as init_algorithm iin parameters) and finally how the new centers are caculated each iteration by defining the abstract method calculate_new_centers which it is then for the specific algorithm how it is done.

Ill now define what is supported for paritioning clustering algorithms in terms of distances, center initilisation and updating centers:

Distances (though ive only tested euclidean and dtw):

  • euclidean
  • dtw
  • ddtw
  • wdtw
  • wddtw
  • lcss
  • erp
  • msm
  • twe (half of these I have no idea what they are but saw them in elastic and assume they were distances and could be used).

Center initialisation:

  • random initailisation. I am currently working on kmeans++ initialisation algorithm hoping to get that in with the k-shapes patch

New center calculation:

  • K-means - Mean (this literally is the mean average) - DTW Barycenter averaging (or DBA)
  • K-medoids - Medoid

So overall a good start. So im going going to put below a summary of what I’m currently working on and my immediate todo list for this coming week and anyone else who wants to hop in and do some stuff is welcome:

Im currently working on:

  • K-shapes algorithm

TODO list (feel free to choose something you’d like to do)

  • K-means++ initialisation algorithm
  • Massive optimisation of the DBA algorithm
    • To do this I think im going to need to reimplement the DTW in numba as we need the cost matrix’s associated with DTW so cant use the existing sktime DTW.
  • Example notebooks. Notebooks explaining how these algorithms work and how to use them
  • Unit tests - my unit tests are sorry excuses. We need to establish HOW we’re going to unit test the clustering algorithms (im going to have a look at how sklearn and tslearn have tested theres).
  • Sklearn has another parameters ive skipped called ‘n_init’ which is the number of times the algorithm is run with different centroid seeds and the final results is the best output of n_init. This is probably a good idea to implement
  • Sklearn also appears to have some inteligent checks where you can pass a parameter called ‘tol’ which is the relative tolerance with regards to Forbenius norm of the difference in the cluster centers of two consecutive iterations to declare convergenece. So being able to smartly detect convergence could be a great optimisation.
  • Parallelism would be nice to have
  • Clustering evaluation - pretty sure we can just use the sklearn stuff but making sure it works and potentailly adding our own api to it and creating plots for time series specific data could be a nice addition
1reaction
chrisholdercommented, Jun 4, 2021

Just going to follow this comment up with how im going about deciding what to implement and where im referencing most of my stuff. At the moment im basically just implementing the state of the art for partitioning as it is the most immediate algorithms I need for my Masters. What I am hoping to achieve over the rest of summer is to get to the same point in terms of algorithm coverage as the ‘leading’ time series clustering package which is probably an R library called dtwclust. I am also of course looking at the Sklearn clustering algorithm and hope to get all of these implemented.

More information on dtwclust can be found here: https://github.com/asardaes/dtwclust. Specifically im looking at https://cran.r-project.org/web/packages/dtwclust/vignettes/dtwclust.pdf for finer details on specific implementations however, im also hoping we can implement further algorithms such as DNNs clustering (cant cite the paper Im referring to for this as it is still under review) and shapelet based approaches (see https://ieeexplore.ieee.org/document/6413851) for example. Overall I hope we can create the most complete set of time series clustering algorithms across paritioning, density and hierarchical approches and am really looking forward to collaborating with others to make this happen!

Read more comments on GitHub >

github_iconTop Results From Across the Web

12 Attribute Clustering
Attribute clustering is a table-level directive that clusters data in close physical proximity based on the content of certain columns.
Read more >
Clustering Keys & Clustered Tables - Snowflake Documentation
To improve the clustering of the underlying table micro-partitions, you can always manually sort rows on key table columns and re-insert them into...
Read more >
Create a failover cluster | Microsoft Learn
This topic shows how to create a failover cluster by using either the Failover Cluster Manager snap-in or Windows PowerShell.
Read more >
Oracle add clustering by linear order (attribute clustering)
Oracle notes that "Just like compression, attribute clustering is a directive that transparently kicks in for certain operations, namely direct path insertion ...
Read more >
Create and use clustered tables | BigQuery - Google Cloud
Create, control access, and use clustered tables. ... Option 2: Click add_box Add field and enter the table schema. Specify each field's Name,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found