[BUG] Extracted shapelets from multivariate dataset are always of the 1st dimension
See original GitHub issueDescribe the bug
Hey there, I’ve been playing around with sktime
in order to detect shapelets in multivariate TS datasets. I’ve been testing with the notebook available in multivariate_time_series_classification.ipynb, specifically I’ve been trying the third method (Bespoke estimator-specific methods) in order to extract these shapelets from multivariate datasets.
What I’ve found is that no matter how big the dataset is (I tried BasicMotions dataset and also an own multivariate dataset, increasing and decreasing them, in order to make sure that all the series are visited), neither how much time do you put the python script to run (I performed tests from 5 to 30 minutes long), the extracted shapelets that ShapeletTransformClassifier
detects are always shapelets associated to the first dimension of the multivariate dataset, i.e., for the BasicMotions dataset we have 6 dimensions, so the shapelets extracted are always from the first one.
I’ve noticed that this method (Bespoke estimator-specific methods) is still under construction, but I would like to know if this behavior is the one that I should expect or this is a bug.
To Reproduce
from sktime.transformers.compose import ColumnConcatenator
from sktime.transformers.shapelets import ContractedShapeletTransform
from sktime.classifiers.compose import TimeSeriesForestClassifier
from sktime.classifiers.dictionary_based.boss import BOSSEnsemble
from sktime.classifiers.compose import ColumnEnsembleClassifier
from sktime.classifiers.shapelet_based import ShapeletTransformClassifier
from sktime.datasets import load_basic_motions
from sktime.pipeline import Pipeline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
###### FUNCTIONS ######
def plotEachShapelets(st, train_x):
# for each extracted shapelet (in descending order of quality/information gain)
for s in st.shapelets[0:5]:
print(s)
# plot the series that the shapelet was extracted from
plt.plot(
train_x.iloc[s.series_id,0],
'gray'
)
# overlay the shapelet onto the full series
plt.plot(
list(range(s.start_pos,(s.start_pos+s.length))),
train_x.iloc[s.series_id,0][s.start_pos:s.start_pos+s.length],
'r',
linewidth=3.0
)
plt.show()
def plotAllShapelets(st, train_x):
# for each extracted shapelet (in descending order of quality/information gain)
for i in range(0,len(st.shapelets)):
s = st.shapelets[i]
# summary info about the shapelet
print("#"+str(i)+": "+str(s))
# overlay shapelets
plt.plot(
list(range(s.start_pos,(s.start_pos+s.length))),
train_x.iloc[s.series_id,0][s.start_pos:s.start_pos+s.length]
)
plt.show()
X_train, y_train = load_basic_motions(split='TRAIN', return_X_y=True)
X_test, y_test = load_basic_motions(split='TEST', return_X_y=True)
clf = ShapeletTransformClassifier(time_contract_in_mins=5)
clf.fit(X_train, y_train)
print("--> Score = " + str(clf.score(X_test, y_test)))
print("--> Shapelets detected = " + str(len(clf.classifier[0].shapelets)))
plotEachShapelets(clf.classifier[0], X_train)
plotAllShapelets(clf.classifier[0], X_train)
Expected behavior I would expect that the shapelets detected are not always shapelets extracted from the first dimension, but also from other dimensions of the multivariate dataset.
Additional context None.
Versions
- Linux-4.4.0-17134-Microsoft-x86_64-with-Ubuntu-16.04-xenial
- Python 3.6.8 (default, May 7 2019, 14:58:50)
- [GCC 5.4.0 20160609]
- NumPy 1.16.4
- SciPy 1.3.0
- sktime 0.3.0
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (2 by maintainers)
Top GitHub Comments
Hey @mloning, it’s ok, I just thought that you wanted me to try it out! I’ll keep waiting and following the issue in order to know when the multivariate Shapelets detection will be available. Thanks for everything!
Hi @DavidCorral94, sorry, I think my previous comment may have been confusing. The
validate_X_y
andcheck_X_is_univariate
are helper functions used inside of estimators to check if the estimator can handle the input data, basically to avoid the original issue you described.The basic motion data set is multivariate, so
check_X_is_univariate
is expected to throw an error.