question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions about items produced by dataset's __getitem__() for multi-target multi-series multi-horizon predictions

See original GitHub issue

I have a dataset which consists of multiple time series of varying length - you can think of each time series as a recording from sensors on some device for some time. The time index is equally-spaced, say it’s every second, and has no real meaning beyond that for each device. Each recording is of varying length. Say I want to use the TCN model to predict sensor values on a new recording.

Each series has nlag by nf auto-regressive features, nlag by nc past covariates and I want to predict npred by nf sequence. Say nlag=10, nf=4, nc=2, npred=5 and I have N=100 such series.

When I look at the items produced by GenericShiftedDataset just before the call to self.fit_from_dataset in torch_forecasting_model.py, items from train_dataset.__getitem__() look strange:

  1. First, the items are mostly shifted by one time step, but in GenericShiftedDataset they should be shifted by input_chunk_length, which in this case is nlag=10.
  2. In the tuple produced by __getitem__(), last element has dimensions nlag by nf instead of npred by nf and first nlag-npred rows overlaps with last nlag-npred rows from the first tuple element. I would expect the last element to just be a time continuation of the first element for npred steps, thus forming an npred by nf array of future values we’re trying to predict.
  3. The number of items quoted by __len__() doesn’t make any sense either - if the series is shifted by 1, and there the M observations, then there should be M-npred-nlag items for each series, but this is not the case.
  4. Some entries produced by __getitem()__ are duplicates for some sequential __getitem__(i) and __getitem__(i-1) indices i.

I have sample code which I can share which illustrates these issues - how can I share it? Am I doing something wrong with the time index of each series (right now it’s just an integer counter from 0 to the length of each series for each series)?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
hrzncommented, May 3, 2022

I think there might be a misunderstanding. What you describe as shift sounds like a “stride” - by how much you slide a certain window. In all datasets implemented here, this is always 1. So that’s why your second series of length 16 yields 2 training samples. In contrast, the shift parameters refers to: The number of time steps by which to shift the output chunks relative to the input chunks.; so basically if shift=5 and the input chunk starts at time t, then the output chunks will start at time t+5 - I hope this clarifies it.

If you implement your own TrainingDataset and you suspect it might be of interest, we would happily take a PR around this 😃

1reaction
hrzncommented, Apr 21, 2022

Hi,

I’m not sure exactly what GenericShiftedDataset you are inspecting, because this is mostly a helper class that is not directly instantiated by the models. Rather, if you call fit() on say, a TCNModel, this model will instantiate for you a PastCovariatesSequentialDataset, which internally makes use of a GenericShiftedDataset . Let me try to answer your questions assuming you’re using such a PastCovariatesSequentialDataset.

  1. First, the items are mostly shifted by one time step, but in GenericShiftedDataset they should be shifted by input_chunk_length, which in this case is nlag=10.

The GenericShiftedDataset is instantiated with shift=input_chunk_length, which corresponds to you nlag. So the “future” item in the returned tuples should come nlag after the start of the “past” items. The different samples returned by different calls to __getitem__(), though, are typically separated by one time step. So for instance if you have a single training series with values [1,2,3,4,5,6,7,8,9] and nlag=3 and output_chunk_length=2, the emmited training samples would be something like:

  • 0-th training sample: ([1,2,3], [4,5]),
  • 1st training sample: ([2,3,4], [5,6]), etc…
  1. In the tuple produced by __getitem__(), last element has dimensions nlag by nf instead of npred by nf

This should not be the case. Could you send a code snippet to reproduce this?

and first nlag-npred rows overlaps with last nlag-npred rows from the first tuple element. I would expect the last element to just be a time continuation of the first element for npred steps, thus forming an npred by nf array of future values we’re trying to predict.

I hope my answer on point 1 answers this.

  1. The number of items quoted by __len__() doesn’t make any sense either - if the series is shifted by 1, and there the M observations, then there should be M-npred-nlag items for each series, but this is not the case.

Your computation is more or less correct, except that M is taken as the maximum length of all your series. See the corresponding explanation in the dataset docstring:

The sampling is uniform over the number of time series; i.e., the i-th sample of this dataset has
a probability 1/N of coming from any of the N time series in the sequence. If the time series have different
lengths, they will contain different numbers of slices. Therefore, some particular slices may
be sampled more often than others if they belong to shorter time series.
  1. Some entries produced by __getitem()__ are duplicates for some sequential __getitem__(i) and __getitem__(i-1) indices i.

I suspect this might be due to the behaviour I described just above. In case you want a different way to draw the samples, you can write your own implementation of TrainingDataset. Having a version that does not “force” the probably of 1/N to come from each of the N series for each sample might also have some general interest, so if you have such a dataset implementation you can PR it 😉

I have sample code which I can share which illustrates these issues - how can I share it? Am I doing something wrong with the time index of each series (right now it’s just an integer counter from 0 to the length of each series for each series)?

This should not cause any issue.

I hope this helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tuning with custom datasets - Hugging Face
In this example, we'll show how to download, tokenize, and train a model on the IMDb reviews dataset. This task takes the text...
Read more >
Building Efficient Custom Datasets in PyTorch
In this article, I will be exploring the PyTorch Dataset object ... the __getitem__ function would return three heterogeneous data items in ...
Read more >
Forecast sales with multiseries modeling: DataRobot docs
The use case provided in this notebook forecasts future sales for multiple stores using multiseries modeling. Multiseries modeling allows you to model datasets...
Read more >
Datasets & DataLoaders — PyTorch Tutorials 1.13.1+cu117 ...
Dataset stores the samples and their corresponding labels, and DataLoader ... import torch from torch.utils.data import Dataset from torchvision import ...
Read more >
Is it advisable to use the same torch Dataset class for training ...
And at the same time, as it feels a bit unnatural, I started wondering if it is even advisable to use the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found