question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unintuitive results working with nullable data (from parquet)

See original GitHub issue

I’ve opened some trees with uproot4 and written the arrays in the parquet format using ak.to_parquet (from some quick tests that I ran, reading from parquet files seems much faster for doubly jagged arrays as compared to TTrees). When the parquet file is read back, from the docs I understand that the type is not exactly the same because the fields are nullable. This appears to lead to some unintuitive results (for the same data, arrays is loaded from the parquet file, whilearrays_original is read directly via uproot):

In [1]: arrays_original["data.fJetPt"].layout
Out[1]: <NumpyArray format="f" shape="25109" data="22.3846 21.4953 20.3251 26.6239 23.0994 ... 21.6158 21.8578 22.133 26.5842 20.8287" at="0x7fafb6180000"/>

In [2]: arrays_original["data.fJetPt"] > 0
Out[2]: <Array [True, True, True, ... True, True, True] type='25109 * bool'>

In [3]: arrays["data.fJetPt"].layout
Out[3]:
<BitMaskedArray valid_when="true" length="25109" lsb_order="true">
    <mask><IndexU8 i="[255 255 255 255 255 ... 255 255 255 255 31]" offset="0" length="3139" at="0x0001114b0000"/></mask>
    <content><NumpyArray format="f" shape="25109" data="22.3846 21.4953 20.3251 26.6239 23.0994 ... 21.6158 21.8578 22.133 26.5842 20.8287" at="0x0001140be600"/></content>
</BitMaskedArray>

In [4]: arrays["data.fJetPt"] > 0
Out[4]: <Array [None, None, None, ... None, None, None] type='25109 * ?bool'>

The output from [4] is not so intuitive from my perspective. My expectation was that it would evaluate to a mask, and if there were somehow missing values (which isn’t the case here), then either leave them as None, or return False (because the condition couldn’t be evaluated). I think I can sort of follow what’s happening: the array is wrapped in a BitMaskedArray (hence the nullable), so asking for bool > 0 is interpreted as a comparison that can’t be made, and it returns None. I also see that I can work with the arrays as normal by ak.fill_none(arrays["data.fJetPt"], 0), which then makes the types non-nullable.

I’m filling this as a bug report because it seems confusing, but it could alternatively be interpreted as a documentation request to add a note to the arrow conversion page that users will likely want to apply fill_none (not ideal if you actually mean to have None, but I’m not sure what to do in that case). Otherwise, it seems that many of the ak functions don’t work (for example, ak.to_numpy also doesn’t work), which can be rather confusing.

Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jpivarskicommented, Aug 17, 2020

It’s merged. Enjoy!

1reaction
jpivarskicommented, Aug 17, 2020

(I find that if I don’t deal with these things right away, other things intervene and I end up never getting back to them. If it’s a bug, I don’t want it to go unfixed.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Spark, Parquet, and Troublesome Nulls - Medium
A column's nullable characteristic is a contract with the Catalyst Optimizer that null data will not be produced. A healthy practice is to ......
Read more >
Spark Filter Rows with NULL Values in DataFrame
While working on Spark DataFrame we often need to filter rows with NULL values on DataFrame columns, you can do this by checking...
Read more >
Querying Parquet with Precision using DuckDB
In Parquet files, data is stored in a columnar-compressed binary format. ... every row group (min/max value, and the number of NULL values)....
Read more >
Inspect Parquet from command line - Stack Overflow
You can use parquet-tools with the command cat and the --json option in order to view the files without a local copy and...
Read more >
Presto, Parquet & Airpal - Tech Blog
This process begins with a fresh Ubuntu 15 installation acting as the host for Docker containers that a Hadoop cluster will live within....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found