Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for parquet files with nested structures

See original GitHub issue

I am trying to use Petastorm to pass a PySpark dataframe (read from a parquet file) to PyTorch.

In my case, the dataframe consists of 3 columns:

samples in columns 1 and 2 are 2D arrays of integers with size (None, 32)
samples in column 3 are scalars

When making the Spark converter using make_spark_converter() and make_torch_dataloader(), columns 1 and 2 are ignored and I get the following warning: UserWarning: [ARROW-1644] Ignoring unsupported structure ListType(list<element: list<element: int32>>)

As I understand, discarding nested structures was added to Petastorm as a workaround to a pyarrow bug for parquet files with nested structures. However, the pyarrow bug has been fixed since arrow version 2.0.0 (see Python notes HERE), my parquet file can be read with pyarrow (after upgrading to latest version).

I would appreciate if such parquet files can be supported by Petastorm.

Best, Mossad

Issue Analytics

State:
Created 2 years ago
Comments:19

Top GitHub Comments

1reaction

mossadhelalicommented, Jun 18, 2022

@baumanab I ended up padding the vectors to make them the same length and created a mask that is multiplied by the model outputs, i.e. similar to what is done in sequence models in NLP.

@selitvin Unfortunately, I don’t have the time currently to work on a fix.

1reaction

MrSquidwardcommented, Dec 31, 2021

Any updates on this issue? Will this workaround be removed in the next version?