Support for parquet files with nested structures
See original GitHub issueI am trying to use Petastorm to pass a PySpark dataframe (read from a parquet file) to PyTorch.
In my case, the dataframe consists of 3 columns:
- samples in columns 1 and 2 are 2D arrays of integers with size (None, 32)
- samples in column 3 are scalars
When making the Spark converter using make_spark_converter()
and make_torch_dataloader()
, columns 1 and 2 are ignored and I get the following warning:
UserWarning: [ARROW-1644] Ignoring unsupported structure ListType(list<element: list<element: int32>>)
As I understand, discarding nested structures was added to Petastorm as a workaround to a pyarrow bug for parquet files with nested structures. However, the pyarrow bug has been fixed since arrow version 2.0.0 (see Python notes HERE), my parquet file can be read with pyarrow (after upgrading to latest version).
I would appreciate if such parquet files can be supported by Petastorm.
Best, Mossad
Issue Analytics
- State:
- Created 2 years ago
- Comments:19
Top GitHub Comments
@baumanab I ended up padding the vectors to make them the same length and created a mask that is multiplied by the model outputs, i.e. similar to what is done in sequence models in NLP.
@selitvin Unfortunately, I don’t have the time currently to work on a fix.
Any updates on this issue? Will this workaround be removed in the next version?