Writing a non-minimal categorical/dictionary to Parquet corrupts it
See original GitHub issueMoving @drahnreb’s comment to a new issue, because it’s unrelated to Parquet file partitioning. Original text below:
I found a rather weird behaviour that seems to be linked to partitioned parquets and my suggested “workaround”.
# current workaround with pq directly table = pq.read_table(parquet_dir) ak.from_arrow(table)
As this is not a supported feature, I did not want to open a bug report but figured it’s worth sharing as it took me a while to figure out that the above seems to cause the following issue.
Previously loaded partitioned data and slicing it (on a low level) will result in a ValueError: in IndexedArray32 attempting to get xxx, index[i] >= len(content) when saving to parquet and reloading.
Working Example: A bit more extensive that explains the principles of what I am trying to achieve, as I realized the combination of certain aspects matter esp. details about the containing dtypes and conversions. I couldn’t figure out what is happening. I first suspected wrong advanced slicing (#370) or the type conversion awk-parquet-arrow as described here or (#393). It was difficult to reproduce the results - so it might not be really a minimal example. Sorry for that!
# let's create some fake data that will be saved as parquet
df = pd.DataFrame([{"id": n, "label": np.random.choice(["OK", "NOK"], 1)[0], "year": np.random.choice([2019, 2020], 1), "arr": np.random.rand(np.random.randint(0,100))} for n in range(10000)])
# cast label
df.label = df.label.astype("category")
df.year = df.year.astype("uint32")
# index
df.set_index(['id'], inplace=True)
# save as partioned parquet
df.to_parquet('test.parquet',
partition_cols=['year'], # this seems to be crucial
version='2.0',
data_page_version='2.0')
# let's get started. reload
table = pq.read_table('t.parquet')
# load as awkward array
original = ak.from_arrow(table)
>>> original.layout
<RecordArray>
<field index="0" key="label">
<IndexedArray64>
<index><Index64 i="[0 0 0 1 0 ... 19 19 19 18 18]" offset="0" length="20026" at="0x55d32435ff10"/></index>
<content><ListArray64>
<parameters>
<param key="__array__">"string"</param>
</parameters>
<starts><Index64 i="[0 3 5 8 10 ... 38 40 43 45 48]" offset="0" length="20" at="0x55d31bdf77d0"/></starts>
<stops><Index64 i="[3 5 8 10 13 ... 40 43 45 48 50]" offset="0" length="20" at="0x55d31be3ff70"/></stops>
<content><NumpyArray format="B" shape="50" data="78 79 75 79 75 ... 78 79 75 79 75" at="0x55d30a4851b0">
<parameters>
<param key="__array__">"char"</param>
</parameters>
</NumpyArray></content>
</ListArray64></content>
</IndexedArray64>
</field>
<field index="1" key="arr">
<ListArray64>
<starts><Index64 i="[0 26 46 54 81 ... 995057 995093 995152 995214 995284]" offset="0" length="20026" at="0x55d3243870f0"/></starts>
<stops><Index64 i="[26 46 54 81 150 ... 995093 995152 995214 995284 995357]" offset="0" length="20026" at="0x55d3243ae2d0"/></stops>
<content><IndexedOptionArray64>
<index><Index64 i="[0 1 2 3 4 ... 995352 995353 995354 995355 995356]" offset="0" length="995357" at="0x55d328748590"/></index>
<content><NumpyArray format="f" shape="995357" data="0.726096 0.590173 0.699854 0.116417 0.743831 ... 0.0755084 0.756987 0.240388 0.158446 0.722705" at="0x55d312824ab0"/></content>
</IndexedOptionArray64></content>
</ListArray64>
</field>
<field index="2" key="id">
<IndexedOptionArray64>
<index><Index64 i="[0 1 2 3 4 ... 20021 20022 20023 20024 20025]" offset="0" length="20026" at="0x55d3243d54b0"/></index>
<content><NumpyArray format="l" shape="20026" data="2 3 4 8 9 ... 9984 9987 9992 9996 9998" at="0x55d312bf0b30"/></content>
</IndexedOptionArray64>
</field>
<field index="3" key="year">
<NumpyArray format="i" shape="20026" data="2019 2019 2019 2019 2019 ... 2020 2020 2020 2020 2020" at="0x55d3233b27b0"/>
</field>
</RecordArray>
Main focus is on the field label with a ListArray64 that is nested in IndexedArray64.
Let’s now mask the array and slice based on indices within each variable length array.
# might be irrelevant
remove = np.random.choice([False, True], size=len(original))
masked = original[~remove].copy()
# "materialize"
masked = ak.flatten(masked, axis=0)
# slice nested arrays based on #370 (unofficial slicing)
starts = np.asarray(masked.arr.layout.starts).astype("uint64")
stops = np.asarray(masked.arr.layout.stops).astype("uint64")
# gather plausible indices with intrinsic information (here randomized)
def _get_each_idx(starts, stops):
diffs = stops-starts
nzd = diffs[np.where(diffs > 0)[0]]
rs = np.random.randint(np.zeros(len(nzd)).astype(int), nzd)
diffs[np.where(diffs != 0)[0]] = rs
return diffs.astype("uint64")
# set new stops and starts
start_idx = _get_each_idx(starts, stops)
new_starts = start_idx + starts
end_idx = _get_each_idx(new_starts, stops)
new_stops = end_idx + new_starts
# change nested arrays (based on unofficial lowlevel layout)
layout = masked.layout
fields = []
for k in layout.keys():
if not k in ['arr']:
fields.append(layout[k])
else:
fields.append(ak.layout.ListArray64(
ak.layout.Index64(starts_n),
ak.layout.Index64(stops_n),
layout[k].content
)
)
# main result. reconstruct back to a RecordArray.
sliced = ak.layout.RecordArray(
fields,
layout.keys()
)
While this is a low level composing of arrays it provides consistent arrays and works until the result is exported as parquet.
>>> ak.is_valid(original), ak.validity_error(original), ak.is_valid(masked), ak.validity_error(masked), ak.is_valid(sliced), ak.validity_error(sliced)
(True, None, True, None, True, None)
The problem starts when reloading data.
>>> sliced_arr = ak.Array(sliced)
>>> sliced_arr = ak.flatten(sliced_arr, axis=0)
>>> ak.to_parquet(sliced_arr, "tmp.parquet")
>>> loaded = ak.from_parquet("tmp.parquet")
>>> loaded
<Array [{label: 'NOK', arr: [, ... year: 2020}] type='10054 * {"label": string, ...'>
>>> ak.is_valid(loaded), ak.validity_error(loaded)
(False,
'at layout.field(0) (IndexedArray32): index[i] >= len(content) at i=4936')
Jagged arrays can be accessed along the entire axis.
>>> sliced_arr.arr[10000]
<Array [0.982, 0.695, 0.111, ... 0.637, 0.984] type='77 * ?float32'>
>>> sliced_arr.label[10000]
'NOK'
>>> loaded.arr[10000]
<Array [0.982, 0.695, 0.111, ... 0.637, 0.984] type='77 * ?float32'>
… label field will throw a ValueError: in IndexedArray32 attempting to get 4936, index[i] >= len(content)
for i in range(len(loaded)):
try:
loaded[i, 'label'].tolist()
except ValueError:
print(i)
pass
_Originally posted by @drahnreb in https://github.com/scikit-hep/awkward-1.0/issues/368#issuecomment-676286664_
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)

Top Related StackOverflow Question
I’ve reported this as ARROW-9801. Meanwhile, I’ll look into fixing things on our side so that we don’t hit this, and gain “categorical” as a distinct thing from IndexedArray.
Thanks for reporting this issue, though I have to complain a bit about the reproducer: the relevant part is between the
sliced_arr, which is valid, and theloaded, which is not. Building arrays from handmade layouts is supported (it’s part of the public API), but it’s exactly the sort of thing that leads to validity errors.However, in this case it really is a bug: there’s an error in the writing of non-minimal categorical/dictionary data (what Awkward calls IndexedArrays) to Parquet. Actually, I think it’s a pyarrow bug and I’ll be reporting it on their JIRA right after this.
Your
labelis a categorical variable. Perhaps because it was loaded from partitioned data, the labels are not minimal:"OK"and"NOK"are the unique values, so only two are strictly needed, but the array has about ten of them. That’s not wrong, just less efficient, and not much more inefficient (though if you’re using the integers for equality checks, which one typically does with categorical data, but Awkward doesn’t, then it would be wrong).Here’s a small example:
It’s non-minimal because we could have represented this with just three categories and an
indexof values in[0, 3). This converts to and from Arrow without any problems:However, when pyarrow writes it to Parquet (using the output of
ak.to_arrow), it correctly minimizes the set of categories but garbles the index (probably some interaction with Parquet’s very weird “definition levels” and “repetition levels”):The
indexought to be[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]. Even pyarrow, by itself, doesn’t like it:In fact, it’s possible to cause this error without using Awkward, just pyarrow (something I’ll have to do to report it on the Apache Arrow JIRA).
So it’s a pyarrow bug. (Even if they say that non-minimal dictionaries are not valid, it shouldn’t silently write the wrong thing into a Parquet file.)
However, I think I’m going to do something about it on the Awkward side, anyway: we ought to have an explicit categorical type, just as we have a string type (as a high-level behavior on IndexedArray). Putting an IndexedArray into categorical form would mean minimizing it, and when it’s minimized, we don’t encounter the pyarrow bug. Conversion to Arrow and hence Parquet would turn minimized categoricals into Arrow/Parquet dictionaries and would flatten any other IndexedArray.
I’ll report updates on the pyarrow bug and Awkward categoricals here.