Serialize TensorDataset
See original GitHub issueDescribe the issue: I am having an error when trying to run NNI with a tensor dataset and parquet file. Is there a way to serialize this easily and properly?
Environment:
- Training service (local|remote|pai|aml|etc): local
- Client OS: Ubuntu
- Python version: 3.9
- Is conda/virtualenv/venv used?: Yes
- Is running in Docker?: No
Configuration:
- Experiment config (remember to remove secrets!):
- Search space:
Log message:
TypeError: <torch.utils.data.dataset.TensorDataset object at 0x7fb348f84d60> of type <class ‘torch.utils.data.dataset.TensorDataset’> is not supported to be traced. File an issue at https://github.com/microsoft/nni/issues if you believe this is a mistake.
PayloadTooLarge: Pickle too large when trying to dump <torch.utils.data.dataset.Subset object at 0x7fdc80f7dd30>. This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit.
ValueError: Serialization failed when trying to dump the model because payload too large (larger than 64 KB). This is usually caused by pickling large objects (like datasets) by mistake. See the full error traceback for details and https://nni.readthedocs.io/en/stable/NAS/Serialization.html for how to resolve such issue.
How to reproduce it?: Parquet file can be found here: https://www.kaggle.com/pythonash/end-to-end-simple-and-powerful-dnn-with-leakyrelu/data?select=train.parquet
@nni.trace
def preprocess_dataframe(parquet_file="../train.parquet"):
df = pd.read_parquet(parquet_file)
df_x = df.astype('float16').drop(['time_id'], axis=1)
df_y = pd.DataFrame(df['target'])
df_x = df_x.astype('float16').drop(['target'], axis=1)
scaler = preprocessing.StandardScaler()
df_x[[f'f_{i}' for i in range(300)]] = scaler.fit_transform(df_x[[f'f_{i}' for i in range(300)]])
df_x['investment_id'] = scaler.fit_transform(pd.DataFrame(df_x['investment_id']))
return df_x, df_y
data_df, target = preprocess_dataframe("/train_low_mem.parquet")
inputs, targets = data_df.values, target.values
train, val = train_test_split(inputs, test_size=0.2)
dataset = serialize(TensorDataset(torch.tensor(inputs, dtype=torch.float32),
torch.tensor(targets, dtype=torch.float32)))
train_ds, val_ds = random_split(dataset, [train.shape[0], val.shape[0]])
Issue Analytics
- State:
- Created 2 years ago
- Comments:17 (6 by maintainers)
Top GitHub Comments
@scarlett2018
Hi Scarlett,
I have made some progress but have yet to fully resolve the issue. I still am having a reoccurring problem with running out of memory during the training and search phase of neural architecture search. A few trial runs will complete before an OOM error and program crash.
I will hopefully come up with a viable solution sooner than later. I will update this thread as well when I am able to resolve it or have further questions.
I appreciate you checking in.
Thank you,
Eric
It might look ugly, but my point here is that
train_dataset
andtest_dataset
shouldn’t be directly pickled. Instead, their constructing process should be pickled. You might be shooting for something like this:The idea here is: when the dataloader gets serialized, it resorts to serializing its init arguments, as DataLoader is wrapped by
nni.trace
. Then the first argumentget_train_dataset()
also remembers where it comes from (i.e., from the functionget_train_dataset
), so that we don’t need to pickle the dataset itself, but rather we only need to serialize the functionget_train_dataset
(which is probably some path or binary).Notice here data file is read twice in training and testing. To save that IO, you might use a
global data
and save the dataframe into a global variable. Or more elegantly, you could use datamodule, and put that into fit_kwargs. (I’m not sure whether DataModule is compatible with one-shot strategy though.)