question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT]hudi how to upsert a non null array data to a existing column with array of nulls,optional binary. java.lang.ClassCastException: optional binary element (UTF8) is not a group

See original GitHub issue

Describe the problem you faced We are trying to update an existing column col1 which has schema of a empty array, which is by default taken as array<string>. Perhaps the issue is that the new upcoming records has data in this existing column col1 that is it’s an array of not null values. While upserting it throws error of •••binary Utf8 optional element of not group ••••. We don’t have any predefined schema for these records, it’s all inferred by default. Hence during insert this column col1 schema becomes array<string> by default. But since the new upcoming records have non null or non empty array values while upserting them to tu his column it fails the upsert operation.

In short this issue comes whenever we are trying to update the schema of a column from array<string> to array<struct<>> or array<array<>>. Kindly let me know if there is a work around or solution for it.

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. Insert records which has a column with only empty array as value
  2. Upsert records with atleast one entry of non empty array as value in that column which previously had only empty array.

Expected behavior Expected behaviour would be to upgrade schema of columns which had a default schema for an empty array(i.e array<string>) to the new recieved non empty array value schema. That is upgrade a array based column schema from default array<string> to a more complex schema of the data which the non empty array holds.

Environment Description

  • AWS glue 3.0

  • Hudi version : 0.10.1

  • Spark version : 3.1.2

  • Running on Docker? (yes/no) : no, we are running glue jobs using pyspark

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error. java.lang.ClassCastException: optional binary element (UTF8) is not a group

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
xushiyancommented, May 30, 2022

are you able to try spark 3.2 which has major parquet upgrade to 1.12 ?

0reactions
codopecommented, Sep 7, 2022

We need to upgrade parquet-avro once the above issues are fixed. Closing this as it is not related to Hudi. Created HUDI-4798 to track parquet upgrade.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SUPPORT]hudi how to upsert a non null array data to a ...
Scenario: 1. drop the column with all empty array during upsert and ... doWriteOperation(DataSourceUtils.java:217) at org.apache.hudi.
Read more >
Handling empty arrays in pySpark (optional binary element ...
The solution I've found is setting the option dropFieldIfAllNull to True when reading the json file. This causes field with empty array to ......
Read more >
optional binary <some-field> (UTF8) is not a group - Apache
parquet schema conflict: optional binary <some-field> (UTF8) is not a group ... (BaseCommitActionExecutor.java:264) - Error upserting bucketType UPDATE for ...
Read more >
Query failed: repeated binary array (UTF8) is not a group
If my data is in parquet format(UNCOMPRESSED), it will fail, presto:default> describe my_table_avro_parquet; Column | Type | Null ...
Read more >
Binary array set - Project Nayuki
Introduction. The binary array set is a very space-efficient data structure that supports adding elements and testing membership reasonably ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found