Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyarrow schema registration in Glue

See original GitHub issue

Issue is similar to #29 but for Pyarrow. Pyarrow supports richer types than pandas, in our case ListArray, which translates to array<int> in Glue. The current implementation requires to go through pandas, which stores it in an object column which then gets added as string to the schema. Looking at the code, it looks like we reconstruct the Pyarrow schema anyway, and it might be as simple as expose this entry point as well as pandas.

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

igorborgestcommented, Sep 27, 2019

PR #33 updated to allow casting of data types as arguments… Unfortunately Pyarrow hasn’t type alias for nested types.

Ref: https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/types.pxi#L1684

Maybe in the future we could prioritize the change to accept the data type objects itself instead of the alias. But by now, I think that it is enough to close this issue.

Thank!

1reaction

nicolasdaviaudcommented, Sep 25, 2019

to support ListType in Pyarrow schemas, it might be as simple as replacing lines 286 to 293 in awswrangler.glue by the following snippet.

By the way, it looks like it does not support DataType(null) either (which can happen if the whole column is empty). We decided to cast it to string but it is arbitrary

def _pyarrow2athena(glue, ftype):
    if str(ftype) == 'null':
        return 'string'
    if isinstance(ftype, ListType):
        return f'array<{_pyarrow2athena(glue, ftype.value_type)}>'
    return glue.type_pyarrow2athena(str(ftype))

    schema = []
    partition_cols_schema = []
    for field in pyarrow_schema:
        name = field.name
        # field.type list is not supported by glue.type_pyarrow2athena
        athena_type = _pyarrow2athena(glue, field.type)
        if partition_cols is None or name not in partition_cols:
            schema.append((name, athena_type))
        else:
            partition_cols_schema.append((name, athena_type))