Validation Data is not generated properly for my dataset
See original GitHub issueHi, Really appreciate your work on the TFT.
I am trying to use my own dataset in the code but there seems to be a bug due to which the dataset is not being loaded properly for validation. The train dataloader is good. but the validation dataloader only has one batch and also validation(TimeSeriesDataSet) has only 1 entry.
Below is my complete code
data = load_csv()
data['date']= pd.to_datetime(data['date'])
data.reset_index(inplace=True, drop=True)
data.reset_index(inplace=True)
data.rename(columns={'index':'time_idx'}, inplace=True) # I use index as time_idx since my data is of minute frequency
validation_len = int(len(data) * 0.1)
training_cutoff = int(len(data)) - validation_len
max_encode_length = 36
max_prediction_length = 6
print('Len of training data is : ',len(data[:training_cutoff]))
print('Len of val data is : ',len(data[training_cutoff:]))
training = TimeSeriesDataSet(
data[:training_cutoff],
time_idx="time_idx",
target="T",
group_ids=["Symbol"],
max_encoder_length=max_encode_length,
max_prediction_length=max_prediction_length,
static_categoricals=["Symbol"],
static_reals=[],
time_varying_known_categoricals=[
"hour_of_day",
"day_of_week",
],
time_varying_known_reals=[
"time_idx",
],
time_varying_unknown_categoricals=[],
time_varying_unknown_reals=["V1", "V2","V3", "T", "V4"],
constant_fill_strategy={"T": 0},
dropout_categoricals=[],
)
print('Max Prediction Index : ',training.index.time.max())
validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training.index.time.max()+1)
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=1)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=1)
print(len(training), len(validation))
print(len(train_dataloader), len(val_dataloader))`
This is what gets printed by the code :
Len of training data is : 25920
Len of val data is : 2880
Min Prediction Index Value is 25919
25920 1
202 1
You can see that the training dataset is good and the batches are also okay but validation batch is also 1 and dataset length is also 1.
One more thing. if i use predict = False it generates validation data correctly but another bug arises due to that. if i use predict = True, only 1 batch and 1 sequence is given if i use predict_mode = true on the training dataset it also generates only 1 batch.
Here is a sample of my CSV sample_data.csv.zip
Please Help
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
The first error is due to an unfortunate default in the
from_dataset
method. The default waspredict=True
which causes to the dataset to select the last samples per timeseries for prediction - as you only have one timeseries there is only one sample). I changed the default topredict=False
. You might want to pass nowstop_randomization=True
for the validation dataset.The second error is a genuine error (1 off error) which I just fixed. Thanks for bringing this to my attention! The fix is pushed to master, so installing from git should do the job.
Closing this as there were numerous fixed to the package - thanks again for reporting this issue! Please feel encouraged to raise a new issue in case you encounter any bugs.