ArrayBuilder.append behavior can be confusing: what to do about union of T and T in snapshots?
See original GitHub issueThanks for adding the to_parquet writing functionality.
I’ve been trying it out and run into a couple of issues. Here is a test case:
@pytest.mark.parametrize("dtype", [np.float32, np.complex64])
def test_awkward_arrays_pandas(tmp_path, dtype):
ak = pytest.importorskip("awkward1")
pa = pytest.importorskip("pyarrow")
fastparquet = pytest.importorskip("fastparquet")
builder = ak.ArrayBuilder()
A = ak.from_numpy(np.array([0, 1, 2], dtype=dtype))
B = ak.from_numpy(np.array([0, 1], dtype=dtype))
with builder.list():
builder.append(A)
with builder.list():
builder.append(A)
with builder.list():
pass
with builder.list():
builder.append(B)
ak.to_parquet(builder.snapshot(), tmp_path / "data.parquet")
With the float32 parametrization the following error occurs:
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:2116: in to_parquet
writer = pyarrow.parquet.ParquetWriter(**options)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/pyarrow/parquet.py:551: in __init__
self.writer = _parquet.ParquetWriter(
pyarrow/_parquet.pyx:1280: in pyarrow._parquet.ParquetWriter.__cinit__
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<0: float=0, 1: float=1>
With the complex64 parametrization, the following error occurs:
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:2111: in to_parquet
first = next(iterator)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:2107: in batch_iterator
yield pyarrow.RecordBatch.from_arrays([to_arrow(layout)], [""])
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1693: in to_arrow
return recurse(layout)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1441: in recurse
return recurse(small_layout, mask=mask)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1412: in recurse
content_buffer = recurse(layout.content[: offsets[-1]])
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1552: in recurse
values = [recurse(x) for x in layout.contents]
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1552: in <listcomp>
values = [recurse(x) for x in layout.contents]
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1617: in recurse
return recurse(layout.content[index], mask)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1336: in recurse
arrow_type = pyarrow.from_numpy_dtype(numpy_arr.dtype)
pyarrow/types.pxi:2591: in pyarrow.lib.from_numpy_dtype
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 14
pyarrow/error.pxi:105: ArrowNotImplementedError
The failure on complex numbers is understandable given that the Arrow doesn’t seem to natively support Complex Numbers, but I thought I’d post it here. I’d imagine that a reasonable solution would be to establish a 2D view of floats on the 1D array of complex numbers and output that to Arrow? Should this conversion happen in awkward or in pyarrow? I’m starting to learn the whole Arrow/Parquet ecosystem, so I’m feeling things out. xref #392.
However, my gut feel is that the float32 case should work. What do you think?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)

Top Related StackOverflow Question
Aha! what’s different here is that the first case produced a union (of floats and floats):
and the second case did not:
That’s because when you
ArrayBuilder.appendan Awkward Array, it uses that to make references into the existing array so that complex structures can be added quickly (as a view). ApplyingArrayBuilder.appendto a non-Awkward sequence, such as a NumPy array, makes it fall back into iterating over the data (as a copy). SinceAandBare different structures, they can’t be referenced in the same array without being a union, though a “union of type T and type T” looks a little silly, the work this is doing is to allow discontiguous arrays to be interleaved in the same logical array without copying their contents.We can flatten a “union of type T and type T” into just “type T” by invoking a
simplifyoperation on the union (an internal step applied to some operations; there isn’t anak.simplify). However, that undermines the goal of havingArrayBuilder.snapshotbe an O(1) operation. There’s a trade-off between wanting the structure of the snapshot be usable in more ways—some Arrow conversions can’t deal with unions and Awkward arrays in Numba can’t deal with unions—and having the operation be fast. I’m not sure whether I should make thesimplifyhappen automatically in anArrayBuilder.snapshotinvolving unions or to make that a user choice. (It would be hard to discover, since the symptom would be something like “Arrow can’t convert” or “Numba can’t compile,” which is difficult to trace back unless you see that it’s a union and that raises the appropriate red flags.)In addition, the structure is wrong. I think they should both be
4 * var * var *. In the second case, it’sfloat64because the iteration went through Python, which doesn’t havefloat32as a type.Thanks for addressing this and for the clear explanations!
This is useful to know. I’m still getting to grips with the Awkward/Arrow/Parquet ecosystem so these pointers are helpful.