question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing a non-minimal categorical/dictionary to Parquet corrupts it

See original GitHub issue

Moving @drahnreb’s comment to a new issue, because it’s unrelated to Parquet file partitioning. Original text below:

I found a rather weird behaviour that seems to be linked to partitioned parquets and my suggested “workaround”.

# current workaround with pq directly
table = pq.read_table(parquet_dir)
ak.from_arrow(table)

As this is not a supported feature, I did not want to open a bug report but figured it’s worth sharing as it took me a while to figure out that the above seems to cause the following issue. Previously loaded partitioned data and slicing it (on a low level) will result in a ValueError: in IndexedArray32 attempting to get xxx, index[i] >= len(content) when saving to parquet and reloading.

Working Example: A bit more extensive that explains the principles of what I am trying to achieve, as I realized the combination of certain aspects matter esp. details about the containing dtypes and conversions. I couldn’t figure out what is happening. I first suspected wrong advanced slicing (#370) or the type conversion awk-parquet-arrow as described here or (#393). It was difficult to reproduce the results - so it might not be really a minimal example. Sorry for that!

# let's create some fake data that will be saved as parquet
df = pd.DataFrame([{"id": n, "label": np.random.choice(["OK", "NOK"], 1)[0], "year": np.random.choice([2019, 2020], 1), "arr": np.random.rand(np.random.randint(0,100))} for n in range(10000)])
# cast label
df.label = df.label.astype("category")
df.year = df.year.astype("uint32")
# index
df.set_index(['id'], inplace=True)
# save as partioned parquet
df.to_parquet('test.parquet',
              partition_cols=['year'], # this seems to be crucial
              version='2.0',
              data_page_version='2.0')
# let's get started. reload
table = pq.read_table('t.parquet')
# load as awkward array
original = ak.from_arrow(table)
>>> original.layout
<RecordArray>
    <field index="0" key="label">
        <IndexedArray64>
            <index><Index64 i="[0 0 0 1 0 ... 19 19 19 18 18]" offset="0" length="20026" at="0x55d32435ff10"/></index>
            <content><ListArray64>
                <parameters>
                    <param key="__array__">"string"</param>
                </parameters>
                <starts><Index64 i="[0 3 5 8 10 ... 38 40 43 45 48]" offset="0" length="20" at="0x55d31bdf77d0"/></starts>
                <stops><Index64 i="[3 5 8 10 13 ... 40 43 45 48 50]" offset="0" length="20" at="0x55d31be3ff70"/></stops>
                <content><NumpyArray format="B" shape="50" data="78 79 75 79 75 ... 78 79 75 79 75" at="0x55d30a4851b0">
                    <parameters>
                        <param key="__array__">"char"</param>
                    </parameters>
                </NumpyArray></content>
            </ListArray64></content>
        </IndexedArray64>
    </field>
    <field index="1" key="arr">
        <ListArray64>
            <starts><Index64 i="[0 26 46 54 81 ... 995057 995093 995152 995214 995284]" offset="0" length="20026" at="0x55d3243870f0"/></starts>
            <stops><Index64 i="[26 46 54 81 150 ... 995093 995152 995214 995284 995357]" offset="0" length="20026" at="0x55d3243ae2d0"/></stops>
            <content><IndexedOptionArray64>
                <index><Index64 i="[0 1 2 3 4 ... 995352 995353 995354 995355 995356]" offset="0" length="995357" at="0x55d328748590"/></index>
                <content><NumpyArray format="f" shape="995357" data="0.726096 0.590173 0.699854 0.116417 0.743831 ... 0.0755084 0.756987 0.240388 0.158446 0.722705" at="0x55d312824ab0"/></content>
            </IndexedOptionArray64></content>
        </ListArray64>
    </field>
    <field index="2" key="id">
        <IndexedOptionArray64>
            <index><Index64 i="[0 1 2 3 4 ... 20021 20022 20023 20024 20025]" offset="0" length="20026" at="0x55d3243d54b0"/></index>
            <content><NumpyArray format="l" shape="20026" data="2 3 4 8 9 ... 9984 9987 9992 9996 9998" at="0x55d312bf0b30"/></content>
        </IndexedOptionArray64>
    </field>
    <field index="3" key="year">
        <NumpyArray format="i" shape="20026" data="2019 2019 2019 2019 2019 ... 2020 2020 2020 2020 2020" at="0x55d3233b27b0"/>
    </field>
</RecordArray>

Main focus is on the field label with a ListArray64 that is nested in IndexedArray64.

Let’s now mask the array and slice based on indices within each variable length array.

# might be irrelevant
remove = np.random.choice([False, True], size=len(original))
masked = original[~remove].copy()
# "materialize"
masked = ak.flatten(masked, axis=0)
# slice nested arrays based on #370 (unofficial slicing)
starts = np.asarray(masked.arr.layout.starts).astype("uint64")
stops = np.asarray(masked.arr.layout.stops).astype("uint64")
# gather plausible indices with intrinsic information (here randomized)
def _get_each_idx(starts, stops):
    diffs = stops-starts
    nzd = diffs[np.where(diffs > 0)[0]]
    rs = np.random.randint(np.zeros(len(nzd)).astype(int), nzd)
    diffs[np.where(diffs != 0)[0]] = rs
    return diffs.astype("uint64")
# set  new stops and starts
start_idx = _get_each_idx(starts, stops)
new_starts = start_idx + starts
end_idx = _get_each_idx(new_starts, stops)
new_stops = end_idx + new_starts
# change nested arrays (based on unofficial lowlevel layout)
layout = masked.layout
fields = []
for k in layout.keys():
    if not k in ['arr']:
        fields.append(layout[k])
    else:
        fields.append(ak.layout.ListArray64(
            ak.layout.Index64(starts_n),
            ak.layout.Index64(stops_n),
            layout[k].content
            )
        )

# main result. reconstruct back to a RecordArray.   
sliced = ak.layout.RecordArray(
        fields,
        layout.keys()
    )

While this is a low level composing of arrays it provides consistent arrays and works until the result is exported as parquet.

>>> ak.is_valid(original), ak.validity_error(original), ak.is_valid(masked), ak.validity_error(masked), ak.is_valid(sliced), ak.validity_error(sliced)
(True, None, True, None, True, None)

The problem starts when reloading data.

>>> sliced_arr = ak.Array(sliced)
>>> sliced_arr = ak.flatten(sliced_arr, axis=0)
>>> ak.to_parquet(sliced_arr, "tmp.parquet")

>>> loaded = ak.from_parquet("tmp.parquet")
>>> loaded
<Array [{label: 'NOK', arr: [, ... year: 2020}] type='10054 * {"label": string, ...'>
>>> ak.is_valid(loaded), ak.validity_error(loaded)
(False,
 'at layout.field(0) (IndexedArray32): index[i] >= len(content) at i=4936')

Jagged arrays can be accessed along the entire axis.

>>> sliced_arr.arr[10000]
<Array [0.982, 0.695, 0.111, ... 0.637, 0.984] type='77 * ?float32'>
>>> sliced_arr.label[10000]
'NOK'
>>> loaded.arr[10000]
<Array [0.982, 0.695, 0.111, ... 0.637, 0.984] type='77 * ?float32'>

label field will throw a ValueError: in IndexedArray32 attempting to get 4936, index[i] >= len(content)

for i in range(len(loaded)):
    try:
        loaded[i, 'label'].tolist()
    except ValueError:        
        print(i)
        pass

@jpivarski

_Originally posted by @drahnreb in https://github.com/scikit-hep/awkward-1.0/issues/368#issuecomment-676286664_

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
jpivarskicommented, Aug 19, 2020

I’ve reported this as ARROW-9801. Meanwhile, I’ll look into fixing things on our side so that we don’t hit this, and gain “categorical” as a distinct thing from IndexedArray.

1reaction
jpivarskicommented, Aug 19, 2020

Thanks for reporting this issue, though I have to complain a bit about the reproducer: the relevant part is between the sliced_arr, which is valid, and the loaded, which is not. Building arrays from handmade layouts is supported (it’s part of the public API), but it’s exactly the sort of thing that leads to validity errors.

However, in this case it really is a bug: there’s an error in the writing of non-minimal categorical/dictionary data (what Awkward calls IndexedArrays) to Parquet. Actually, I think it’s a pyarrow bug and I’ll be reporting it on their JIRA right after this.

Your label is a categorical variable. Perhaps because it was loaded from partitioned data, the labels are not minimal: "OK" and "NOK" are the unique values, so only two are strictly needed, but the array has about ten of them. That’s not wrong, just less efficient, and not much more inefficient (though if you’re using the integers for equality checks, which one typically does with categorical data, but Awkward doesn’t, then it would be wrong).

Here’s a small example:

>>> import awkward1 as ak
>>> import numpy as np
>>> categories = ak.Array(["one", "two", "three", "one", "two", "three"])
>>> index = ak.layout.Index32(np.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], np.int32))
>>> indexedarray = ak.layout.IndexedArray32(index, categories.layout)
>>> original = ak.Array(indexedarray)
>>> original
<Array ['one', 'two', ... 'two', 'three'] type='12 * string'>
>>> original.tolist()
['one', 'two', 'three', 'one', 'two', 'three', 'one', 'two', 'three', 'one', 'two', 'three']

It’s non-minimal because we could have represented this with just three categories and an index of values in [0, 3). This converts to and from Arrow without any problems:

>>> roundtrip = ak.from_arrow(ak.to_arrow(original))
>>> roundtrip.tolist()
['one', 'two', 'three', 'one', 'two', 'three', 'one', 'two', 'three', 'one', 'two', 'three']

>>> original.layout
<IndexedArray32>
    <index><Index32 i="[0 1 2 3 4 ... 1 2 3 4 5]" offset="0" length="12" at="0x55edccd57930"/></index>
    <content><ListOffsetArray64>
        <parameters>
            <param key="__array__">"string"</param>
        </parameters>
        <offsets><Index64 i="[0 3 6 11 14 17 22]" offset="0" length="7" at="0x55edccdae190"/></offsets>
        <content><NumpyArray format="B" shape="22" data="111 110 101 116 119 ... 116 104 114 101 101" at="0x55edccd3ded0">
            <parameters>
                <param key="__array__">"char"</param>
            </parameters>
        </NumpyArray></content>
    </ListOffsetArray64></content>
</IndexedArray32>

>>> roundtrip.layout
<IndexedArray32>
    <index><Index32 i="[0 1 2 3 4 ... 1 2 3 4 5]" offset="0" length="12" at="0x55edccd57930"/></index>
    <content><ListOffsetArray32>
        <parameters>
            <param key="__array__">"string"</param>
        </parameters>
        <offsets><Index32 i="[0 3 6 11 14 17 22]" offset="0" length="7" at="0x55edccdc8e60"/></offsets>
        <content><NumpyArray format="B" shape="22" data="111 110 101 116 119 ... 116 104 114 101 101" at="0x55edccd3ded0">
            <parameters>
                <param key="__array__">"char"</param>
            </parameters>
        </NumpyArray></content>
    </ListOffsetArray32></content>
</IndexedArray32>
>>> np.asarray(original.layout.index)

array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
>>> ak.to_list(original.layout.content)
['one', 'two', 'three', 'one', 'two', 'three']

>>> np.asarray(roundtrip.layout.index)
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
>>> ak.to_list(roundtrip.layout.content)
['one', 'two', 'three', 'one', 'two', 'three']

However, when pyarrow writes it to Parquet (using the output of ak.to_arrow), it correctly minimizes the set of categories but garbles the index (probably some interaction with Parquet’s very weird “definition levels” and “repetition levels”):

>>> ak.to_parquet(original, "tmp.parquet")
>>> loaded = ak.from_parquet("tmp.parquet")
>>> loaded.layout
<IndexedArray32>
    <index><Index32 i="[0 1 2 3 0 ... 1 2 3 0 1]" offset="0" length="12" at="0x7f25a7e00280"/></index>
    <content><ListOffsetArray32>
        <parameters>
            <param key="__array__">"string"</param>
        </parameters>
        <offsets><Index32 i="[0 3 6 11]" offset="0" length="4" at="0x7f25a7e00240"/></offsets>
        <content><NumpyArray format="B" shape="11" data="111 110 101 116 119 ... 116 104 114 101 101" at="0x7f25a7e002c0">
            <parameters>
                <param key="__array__">"char"</param>
            </parameters>
        </NumpyArray></content>
    </ListOffsetArray32></content>
</IndexedArray32>
>>> np.asarray(loaded.layout.index)
array([0, 1, 2, 3, 0, 1, 1, 1, 2, 3, 0, 1], dtype=int32)    # <---- BAD!
>>> ak.to_list(loaded.layout.content)
['one', 'two', 'three']

The index ought to be [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]. Even pyarrow, by itself, doesn’t like it:

>>> import pyarrow.parquet
>>> table = pyarrow.parquet.read_table("tmp.parquet")
>>> table
pyarrow.Table
: dictionary<values=string, indices=int32, ordered=0>
>>> table.to_pydict()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict
  File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist
  File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist
  File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py
  File "pyarrow/scalar.pxi", line 701, in pyarrow.lib.DictionaryScalar.value.__get__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 long

In fact, it’s possible to cause this error without using Awkward, just pyarrow (something I’ll have to do to report it on the Apache Arrow JIRA).

>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> pa_array = pa.DictionaryArray.from_arrays(
...     pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),
...     pa.array(["one", "two", "three", "one", "two", "three"])
... )
>>> pa_array
<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>

-- dictionary:
  [
    "one",
    "two",
    "three",
    "one",
    "two",
    "three"
  ]
-- indices:
  [
    0,
    1,
    2,
    3,
    4,
    5,
    0,
    1,
    2,
    3,
    4,
    5
  ]
>>> pa_table = pa.Table.from_batches(
...     [pa.RecordBatch.from_arrays([pa_array], ["column"])]
... )
>>> pa_table
pyarrow.Table
column: dictionary<values=string, indices=int64, ordered=0>
>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")
>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")
>>> pa_loaded
pyarrow.Table
column: dictionary<values=string, indices=int32, ordered=0>
>>> pa_loaded.to_pydict()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict
  File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist
  File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist
  File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py
  File "pyarrow/scalar.pxi", line 701, in pyarrow.lib.DictionaryScalar.value.__get__
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 long
>>> pa.__version__
'1.0.0'

So it’s a pyarrow bug. (Even if they say that non-minimal dictionaries are not valid, it shouldn’t silently write the wrong thing into a Parquet file.)

However, I think I’m going to do something about it on the Awkward side, anyway: we ought to have an explicit categorical type, just as we have a string type (as a high-level behavior on IndexedArray). Putting an IndexedArray into categorical form would mean minimizing it, and when it’s minimized, we don’t encounter the pyarrow bug. Conversion to Arrow and hence Parquet would turn minimized categoricals into Arrow/Parquet dictionaries and would flatten any other IndexedArray.

I’ll report updates on the pyarrow bug and Awkward categoricals here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reading and Writing the Apache Parquet Format
Reading Parquet and Memory Mapping¶. Because Parquet data needs to be decoded from the Parquet format and compression, it can't be directly mapped...
Read more >
pyarrow: .parquet file that used to work perfectly is now ...
pd.read_parquet(filename). I get: ArrowIOError: Corrupted file, smaller than file footer. What can cause this corruption? Is there a way to ...
Read more >
Using the Parquet format in AWS Glue
The following AWS Glue ETL script shows the process of writing Parquet files and folders to S3. We provide a custom Parquet writer...
Read more >
Parquet - Databricks
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and ......
Read more >
Troubleshoot the Parquet format connector - Azure Data ...
When the error message contains the string "NullPointerReference", it might be a transient error. Retry the operation. If the problem persists, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found