question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support for parquet files with nested structures

See original GitHub issue

I am trying to use Petastorm to pass a PySpark dataframe (read from a parquet file) to PyTorch.

In my case, the dataframe consists of 3 columns:

  • samples in columns 1 and 2 are 2D arrays of integers with size (None, 32)
  • samples in column 3 are scalars

When making the Spark converter using make_spark_converter() and make_torch_dataloader(), columns 1 and 2 are ignored and I get the following warning: UserWarning: [ARROW-1644] Ignoring unsupported structure ListType(list<element: list<element: int32>>)

As I understand, discarding nested structures was added to Petastorm as a workaround to a pyarrow bug for parquet files with nested structures. However, the pyarrow bug has been fixed since arrow version 2.0.0 (see Python notes HERE), my parquet file can be read with pyarrow (after upgrading to latest version).

I would appreciate if such parquet files can be supported by Petastorm.

Best, Mossad

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:19

github_iconTop GitHub Comments

1reaction
mossadhelalicommented, Jun 18, 2022

@baumanab I ended up padding the vectors to make them the same length and created a mask that is multiplied by the model outputs, i.e. similar to what is done in sequence models in NLP.

@selitvin Unfortunately, I don’t have the time currently to work on a fix.

1reaction
MrSquidwardcommented, Dec 31, 2021

Any updates on this issue? Will this workaround be removed in the next version?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is the benefit of using nested data types in Parquet?
Doesn't the nested data schema allow for predicate pushdown i.e. "column statistics" for each nested value as well? I believe I saw such...
Read more >
Nested data representation in Parquet on waitingforcode.com
Parquet stores nested structures thanks to structures called repetition and definition levels. The first one is used to determine when a new ...
Read more >
Query Parquet nested types using serverless SQL pool
Query nested types in Parquet and JSON files by using serverless ... Nested types are complex structures that represent objects or arrays.
Read more >
Nested Encoding - Apache Parquet
To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many ...
Read more >
Dremel made simple with Parquet - Twitter Blog
Parquet stores nested data structures in a flat columnar format using a technique outlined in the Dremel paper from Google.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found