Questions about items produced by dataset's __getitem__() for multi-target multi-series multi-horizon predictions
See original GitHub issueI have a dataset which consists of multiple time series of varying length - you can think of each time series as a recording from sensors on some device for some time. The time index is equally-spaced, say it’s every second, and has no real meaning beyond that for each device. Each recording is of varying length. Say I want to use the TCN model to predict sensor values on a new recording.
Each series has nlag by nf
auto-regressive features, nlag by nc
past covariates and I want to predict npred by nf
sequence. Say nlag=10
, nf=4
, nc=2
, npred=5
and I have N=100
such series.
When I look at the items produced by GenericShiftedDataset
just before the call to self.fit_from_dataset
in torch_forecasting_model.py
, items from train_dataset.__getitem__()
look strange:
- First, the items are mostly shifted by one time step, but in
GenericShiftedDataset
they should be shifted byinput_chunk_length
, which in this case isnlag=10
. - In the tuple produced by
__getitem__()
, last element has dimensionsnlag by nf
instead ofnpred by nf
and firstnlag-npred
rows overlaps with lastnlag-npred
rows from the first tuple element. I would expect the last element to just be a time continuation of the first element fornpred
steps, thus forming annpred by nf
array of future values we’re trying to predict. - The number of items quoted by
__len__()
doesn’t make any sense either - if the series is shifted by 1, and there the M observations, then there should beM-npred-nlag
items for each series, but this is not the case. - Some entries produced by
__getitem()__
are duplicates for some sequential__getitem__(i)
and__getitem__(i-1)
indicesi
.
I have sample code which I can share which illustrates these issues - how can I share it? Am I doing something wrong with the time index of each series (right now it’s just an integer counter from 0 to the length of each series for each series)?
Issue Analytics
- State:
- Created a year ago
- Comments:10 (5 by maintainers)
I think there might be a misunderstanding. What you describe as
shift
sounds like a “stride” - by how much you slide a certain window. In all datasets implemented here, this is always 1. So that’s why your second series of length 16 yields 2 training samples. In contrast, theshift
parameters refers to:The number of time steps by which to shift the output chunks relative to the input chunks.
; so basically ifshift=5
and the input chunk starts at timet
, then the output chunks will start at timet+5
- I hope this clarifies it.If you implement your own
TrainingDataset
and you suspect it might be of interest, we would happily take a PR around this 😃Hi,
I’m not sure exactly what
GenericShiftedDataset
you are inspecting, because this is mostly a helper class that is not directly instantiated by the models. Rather, if you callfit()
on say, a TCNModel, this model will instantiate for you aPastCovariatesSequentialDataset
, which internally makes use of aGenericShiftedDataset
. Let me try to answer your questions assuming you’re using such aPastCovariatesSequentialDataset
.The
GenericShiftedDataset
is instantiated withshift=input_chunk_length
, which corresponds to younlag
. So the “future” item in the returned tuples should comenlag
after the start of the “past” items. The different samples returned by different calls to__getitem__()
, though, are typically separated by one time step. So for instance if you have a single training series with values[1,2,3,4,5,6,7,8,9]
andnlag=3
andoutput_chunk_length=2
, the emmited training samples would be something like:([1,2,3], [4,5])
,([2,3,4], [5,6])
, etc…This should not be the case. Could you send a code snippet to reproduce this?
I hope my answer on point 1 answers this.
Your computation is more or less correct, except that
M
is taken as the maximum length of all your series. See the corresponding explanation in the dataset docstring:I suspect this might be due to the behaviour I described just above. In case you want a different way to draw the samples, you can write your own implementation of
TrainingDataset
. Having a version that does not “force” the probably of 1/N to come from each of the N series for each sample might also have some general interest, so if you have such a dataset implementation you can PR it 😉This should not cause any issue.
I hope this helps.