Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Serialize TensorDataset

See original GitHub issue

Describe the issue: I am having an error when trying to run NNI with a tensor dataset and parquet file. Is there a way to serialize this easily and properly?

Environment:

Training service (local|remote|pai|aml|etc): local
Client OS: Ubuntu
Python version: 3.9
Is conda/virtualenv/venv used?: Yes
Is running in Docker?: No

Configuration:

Experiment config (remember to remove secrets!):
Search space:

Log message:

TypeError: <torch.utils.data.dataset.TensorDataset object at 0x7fb348f84d60> of type <class ‘torch.utils.data.dataset.TensorDataset’> is not supported to be traced. File an issue at https://github.com/microsoft/nni/issues if you believe this is a mistake.

PayloadTooLarge: Pickle too large when trying to dump <torch.utils.data.dataset.Subset object at 0x7fdc80f7dd30>. This might be caused by classes that are not decorated by @nni.trace. Another option is to force bytes pickling and try to raise pickle_size_limit.

ValueError: Serialization failed when trying to dump the model because payload too large (larger than 64 KB). This is usually caused by pickling large objects (like datasets) by mistake. See the full error traceback for details and https://nni.readthedocs.io/en/stable/NAS/Serialization.html for how to resolve such issue.

How to reproduce it?: Parquet file can be found here: https://www.kaggle.com/pythonash/end-to-end-simple-and-powerful-dnn-with-leakyrelu/data?select=train.parquet

@nni.trace
def preprocess_dataframe(parquet_file="../train.parquet"):
  df = pd.read_parquet(parquet_file)
  df_x = df.astype('float16').drop(['time_id'], axis=1)
  df_y = pd.DataFrame(df['target'])
  df_x = df_x.astype('float16').drop(['target'], axis=1)
  scaler = preprocessing.StandardScaler()
  df_x[[f'f_{i}' for i in range(300)]] = scaler.fit_transform(df_x[[f'f_{i}' for i in range(300)]])
  df_x['investment_id'] = scaler.fit_transform(pd.DataFrame(df_x['investment_id']))
  return df_x, df_y

data_df, target = preprocess_dataframe("/train_low_mem.parquet")
inputs, targets = data_df.values, target.values
train, val = train_test_split(inputs, test_size=0.2)

dataset = serialize(TensorDataset(torch.tensor(inputs, dtype=torch.float32),
                       torch.tensor(targets, dtype=torch.float32)))

train_ds, val_ds = random_split(dataset, [train.shape[0], val.shape[0]])

Issue Analytics

State:
Created 2 years ago
Comments:17 (6 by maintainers)

Top GitHub Comments

1reaction

conceptofmindcommented, Mar 18, 2022

@scarlett2018

Hi Scarlett,

I have made some progress but have yet to fully resolve the issue. I still am having a reoccurring problem with running out of memory during the training and search phase of neural architecture search. A few trial runs will complete before an OOM error and program crash.

I will hopefully come up with a viable solution sooner than later. I will update this thread as well when I am able to resolve it or have further questions.

I appreciate you checking in.

Thank you,

Eric

0reactions

matlustercommented, Dec 12, 2022

It might look ugly, but my point here is that train_dataset and test_dataset shouldn’t be directly pickled. Instead, their constructing process should be pickled. You might be shooting for something like this:

@nni.trace
def get_train_dataset():
  data = pd.read_csv("data/ERR_complete.tsv", sep='\t')
  training_data = data.sample(frac=0.90, random_state=25)
  train_dataset = MiAIRRDataset(training_data, "data/M.fasta")

@nni.trace
def get_test_dataset():
  data = pd.read_csv("data/ERR_complete.tsv", sep='\t')
  testing_data = data.drop(training_data.index)
  test_dataset = MiAIRRDataset(testing_data, "data/M.fasta")

train_dataloader = nni.trace(DataLoader)(get_train_dataset(), batch_size=64, collate_fn=PadSequence())
test_dataloader = nni.trace(DataLoader)(get_test_dataset(), batch_size=64, collate_fn=PadSequence())

The idea here is: when the dataloader gets serialized, it resorts to serializing its init arguments, as DataLoader is wrapped by nni.trace. Then the first argument get_train_dataset() also remembers where it comes from (i.e., from the function get_train_dataset), so that we don’t need to pickle the dataset itself, but rather we only need to serialize the function get_train_dataset (which is probably some path or binary).

Notice here data file is read twice in training and testing. To save that IO, you might use a global data and save the dataframe into a global variable. Or more elegantly, you could use datamodule, and put that into fit_kwargs. (I’m not sure whether DataModule is compatible with one-shot strategy though.)

Top Results From Across the Web

torch.utils.data — PyTorch 1.13 documentation

This separate serialization means that you should take two steps to ensure you are ... 5) dataset = TensorDataset(inps, tgts) loader = DataLoader(dataset, ......

torch.utils.data — PyTorch master documentation

Dataset wrapping tensors. Each sample will be retrieved by indexing tensors along the first dimension. Parameters: *tensors (Tensor) – tensors that have the ......

tf.io.serialize_tensor | TensorFlow v2.11.0

Transforms a Tensor into a serialized TensorProto proto. View aliases.

How do you save a Tensorflow dataset to a file? - Stack Overflow

How can I modify the above (or do something else) to accomplish my goal? python · tensorflow · serialization · tensorflow-datasets · Share....

T24599 Problem with numpy dataset and json serialization in ... - MITK

Numpy array containing the image dataset, wrapped in pytorch TensorDataset, DataLoader. Output: TypeError: Object of type 'DataLoader' is not JSON ...