Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation Data is not generated properly for my dataset

See original GitHub issue

Hi, Really appreciate your work on the TFT.

I am trying to use my own dataset in the code but there seems to be a bug due to which the dataset is not being loaded properly for validation. The train dataloader is good. but the validation dataloader only has one batch and also validation(TimeSeriesDataSet) has only 1 entry.

Below is my complete code

data = load_csv()
data['date']= pd.to_datetime(data['date'])
data.reset_index(inplace=True, drop=True)
data.reset_index(inplace=True)
data.rename(columns={'index':'time_idx'}, inplace=True) # I use index as time_idx since my data is of minute frequency

validation_len = int(len(data) * 0.1)
training_cutoff = int(len(data)) - validation_len

max_encode_length = 36
max_prediction_length = 6

print('Len of training data is : ',len(data[:training_cutoff]))
print('Len of val data is : ',len(data[training_cutoff:]))
training = TimeSeriesDataSet(
    data[:training_cutoff],
    time_idx="time_idx",
    target="T",
    group_ids=["Symbol"],
    max_encoder_length=max_encode_length,
    max_prediction_length=max_prediction_length,
    static_categoricals=["Symbol"],
    static_reals=[],
    time_varying_known_categoricals=[
        "hour_of_day",
        "day_of_week",
    ],
    time_varying_known_reals=[
        "time_idx",
    ],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=["V1", "V2","V3", "T", "V4"],
    constant_fill_strategy={"T": 0},
    dropout_categoricals=[],
)
print('Max Prediction Index : ',training.index.time.max())
validation = TimeSeriesDataSet.from_dataset(training, data, min_prediction_idx=training.index.time.max()+1)
batch_size = 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=1)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=1)

print(len(training), len(validation))
print(len(train_dataloader), len(val_dataloader))`

This is what gets printed by the code :

Len of training data is : 25920
Len of val data is : 2880
Min Prediction Index Value is 25919

25920 1
202 1

You can see that the training dataset is good and the batches are also okay but validation batch is also 1 and dataset length is also 1.

One more thing. if i use predict = False it generates validation data correctly but another bug arises due to that. if i use predict = True, only 1 batch and 1 sequence is given if i use predict_mode = true on the training dataset it also generates only 1 batch.

Here is a sample of my CSV sample_data.csv.zip

Please Help

Issue Analytics

State:
Created 3 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

jdb78commented, Jul 26, 2020

The first error is due to an unfortunate default in the from_dataset method. The default was predict=True which causes to the dataset to select the last samples per timeseries for prediction - as you only have one timeseries there is only one sample). I changed the default to predict=False. You might want to pass now stop_randomization=True for the validation dataset.

The second error is a genuine error (1 off error) which I just fixed. Thanks for bringing this to my attention! The fix is pushed to master, so installing from git should do the job.

0reactions

jdb78commented, Aug 25, 2020

Closing this as there were numerous fixed to the package - thanks again for reporting this issue! Please feel encouraged to raise a new issue in case you encounter any bugs.

Top Results From Across the Web

What to do if the model is not performing well on a validation ...

What I am intending to do next is to train the model on subsets of columns of the initial training set and try...

Data Validation Series: Common Errors in Datasets

1. Null Values. The quickest and easiest indicator that verification is needed is null values. · 2. Duplicated Rows. Repeating rows are another ......

tensorflow.keras.model.fit can not read validation data in ...

I found out the problem! I was using a custom loss function that returns the sum of the losses computed on each element...

What is Data Validation? How It Works and Why It's Important

Data validation is an essential part of any data handling task whether you're in the field collecting information, analyzing data, or preparing to...

The Model Performance Mismatch Problem (and what to do ...

Try a k-fold cross-validation evaluation of the model on the test dataset. · Try a fit of the model on the training dataset...