question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ArrayBuilder.append behavior can be confusing: what to do about union of T and T in snapshots?

See original GitHub issue

Thanks for adding the to_parquet writing functionality.

I’ve been trying it out and run into a couple of issues. Here is a test case:

@pytest.mark.parametrize("dtype", [np.float32, np.complex64])
def test_awkward_arrays_pandas(tmp_path, dtype):
    ak = pytest.importorskip("awkward1")
    pa = pytest.importorskip("pyarrow")
    fastparquet = pytest.importorskip("fastparquet")


    builder = ak.ArrayBuilder()
    A = ak.from_numpy(np.array([0, 1, 2], dtype=dtype))
    B = ak.from_numpy(np.array([0, 1], dtype=dtype))

    with builder.list():
        builder.append(A)

    with builder.list():
        builder.append(A)

    with builder.list():
        pass

    with builder.list():
        builder.append(B)

    ak.to_parquet(builder.snapshot(), tmp_path / "data.parquet")

With the float32 parametrization the following error occurs:

../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:2116: in to_parquet
    writer = pyarrow.parquet.ParquetWriter(**options)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/pyarrow/parquet.py:551: in __init__
    self.writer = _parquet.ParquetWriter(
pyarrow/_parquet.pyx:1280: in pyarrow._parquet.ParquetWriter.__cinit__
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<0: float=0, 1: float=1>

With the complex64 parametrization, the following error occurs:

../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:2111: in to_parquet
    first = next(iterator)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:2107: in batch_iterator
    yield pyarrow.RecordBatch.from_arrays([to_arrow(layout)], [""])
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1693: in to_arrow
    return recurse(layout)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1441: in recurse
    return recurse(small_layout, mask=mask)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1412: in recurse
    content_buffer = recurse(layout.content[: offsets[-1]])
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1552: in recurse
    values = [recurse(x) for x in layout.contents]
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1552: in <listcomp>
    values = [recurse(x) for x in layout.contents]
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1617: in recurse
    return recurse(layout.content[index], mask)
../../../../.cache/pypoetry/virtualenvs/ms-backends-8aHy9NJ9-py3.8/lib/python3.8/site-packages/awkward1/operations/convert.py:1336: in recurse
    arrow_type = pyarrow.from_numpy_dtype(numpy_arr.dtype)
pyarrow/types.pxi:2591: in pyarrow.lib.from_numpy_dtype
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 14

pyarrow/error.pxi:105: ArrowNotImplementedError

The failure on complex numbers is understandable given that the Arrow doesn’t seem to natively support Complex Numbers, but I thought I’d post it here. I’d imagine that a reasonable solution would be to establish a 2D view of floats on the 1D array of complex numbers and output that to Arrow? Should this conversion happen in awkward or in pyarrow? I’m starting to learn the whole Arrow/Parquet ecosystem, so I’m feeling things out. xref #392.

However, my gut feel is that the float32 case should work. What do you think?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jpivarskicommented, Aug 25, 2020

Aha! what’s different here is that the first case produced a union (of floats and floats):

>>> builder.snapshot()
<Array [[0, 1, 2], [0, 1, 2], [], [0, 1]] type='4 * var * union[float32, float32]'>

and the second case did not:

>>> builder.snapshot()
<Array [[[0, 1, 2]], [[0, 1, ... [], [[0, 1]]] type='4 * var * var * float64'>

That’s because when you ArrayBuilder.append an Awkward Array, it uses that to make references into the existing array so that complex structures can be added quickly (as a view). Applying ArrayBuilder.append to a non-Awkward sequence, such as a NumPy array, makes it fall back into iterating over the data (as a copy). Since A and B are different structures, they can’t be referenced in the same array without being a union, though a “union of type T and type T” looks a little silly, the work this is doing is to allow discontiguous arrays to be interleaved in the same logical array without copying their contents.

We can flatten a “union of type T and type T” into just “type T” by invoking a simplify operation on the union (an internal step applied to some operations; there isn’t an ak.simplify). However, that undermines the goal of having ArrayBuilder.snapshot be an O(1) operation. There’s a trade-off between wanting the structure of the snapshot be usable in more ways—some Arrow conversions can’t deal with unions and Awkward arrays in Numba can’t deal with unions—and having the operation be fast. I’m not sure whether I should make the simplify happen automatically in an ArrayBuilder.snapshot involving unions or to make that a user choice. (It would be hard to discover, since the symptom would be something like “Arrow can’t convert” or “Numba can’t compile,” which is difficult to trace back unless you see that it’s a union and that raises the appropriate red flags.)

In addition, the structure is wrong. I think they should both be 4 * var * var *. In the second case, it’s float64 because the iteration went through Python, which doesn’t have float32 as a type.

0reactions
sjperkinscommented, Aug 26, 2020

Thanks for addressing this and for the clear explanations!

By the way, “fastparquet” is not needed/not used in the serialization to Parquet. That’s a Parquet reader/writer implemented in Numba; pyarrow uses one implemented in C++.

This is useful to know. I’m still getting to grips with the Awkward/Arrow/Parquet ecosystem so these pointers are helpful.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ArrayBuilder.append behavior can be confusing: what to do ...
On the original issue about the "union of T and T", I'm going to add an internal simplify operation, so that any mergable...
Read more >
ak.ArrayBuilder — Awkward Array 2.0.2 documentation
General tool for building arrays of nested data structures from a sequence of commands. Most data types can be constructed by calling commands...
Read more >
BUILD.gn - v8/v8.git - Git at Google
# Expose the memory corruption API to JavaScript. Useful for testing the sandbox. # WARNING This will expose builtins that (by design) cause...
Read more >
Array Builders — Apache Arrow v10.0.1
This class provides a facilities for incrementally building the null bitmap (see Append methods) and as a side effect the current number of...
Read more >
Awkward Array: JSON-like data, NumPy-like idioms
Arrays by taking a "snapshot" of the current state. The ak.ArrayBuilder is also implemented for Numba, so just- in-time compiled Python can build...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found