question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyarrow schema registration in Glue

See original GitHub issue

Issue is similar to #29 but for Pyarrow. Pyarrow supports richer types than pandas, in our case ListArray, which translates to array<int> in Glue. The current implementation requires to go through pandas, which stores it in an object column which then gets added as string to the schema. Looking at the code, it looks like we reconstruct the Pyarrow schema anyway, and it might be as simple as expose this entry point as well as pandas.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
igorborgestcommented, Sep 27, 2019

PR #33 updated to allow casting of data types as arguments… Unfortunately Pyarrow hasn’t type alias for nested types.

Ref: https://github.com/apache/arrow/blob/apache-arrow-0.14.1/python/pyarrow/types.pxi#L1684

Maybe in the future we could prioritize the change to accept the data type objects itself instead of the alias. But by now, I think that it is enough to close this issue.

Thank!

1reaction
nicolasdaviaudcommented, Sep 25, 2019

to support ListType in Pyarrow schemas, it might be as simple as replacing lines 286 to 293 in awswrangler.glue by the following snippet.

By the way, it looks like it does not support DataType(null) either (which can happen if the whole column is empty). We decided to cast it to string but it is arbitrary

def _pyarrow2athena(glue, ftype):
    if str(ftype) == 'null':
        return 'string'
    if isinstance(ftype, ListType):
        return f'array<{_pyarrow2athena(glue, ftype.value_type)}>'
    return glue.type_pyarrow2athena(str(ftype))

    schema = []
    partition_cols_schema = []
    for field in pyarrow_schema:
        name = field.name
        # field.type list is not supported by glue.type_pyarrow2athena
        athena_type = _pyarrow2athena(glue, field.type)
        if partition_cols is None or name not in partition_cols:
            schema.append((name, athena_type))
        else:
            partition_cols_schema.append((name, athena_type))
Read more comments on GitHub >

github_iconTop Results From Across the Web

Pyarrow schema registration in Glue · Issue #32 - GitHub
The current implementation requires to go through pandas, which stores it in an object column which then gets added as string to the...
Read more >
pyarrow.Schema — Apache Arrow v10.0.1
A named collection of types a.k.a schema. A schema defines the column names and types in a record batch or table data structure....
Read more >
AWS Glue Schema Registry
The AWS Glue Schema Registry is a new feature that allows you to centrally discover, control, and evolve data stream schemas. A schema...
Read more >
mojap-metadata - PyPI
This python package allows users to read and alter our metadata schemas (using the metadata module) as well as convert our metadata schemas...
Read more >
awswrangler.s3.read_parquet_table — AWS SDK for pandas ...
Read Apache Parquet table registered on AWS Glue Catalog. ... for conversion of built-in pyarrow types or in absence of pandas_metadata in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found