question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark: Cannot read or write UUID columns

See original GitHub issue

Because of the String -> Fixed Binary conversion the readers and writers are both incorrect.

The vectorized reader initializes a FixedBinary reader on a column we report is a String causing an unsupported reader exception.

java.lang.UnsupportedOperationException: Unsupported type: UTF8String
	at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:82)
	at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:140)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Sour

The writer is broken because it gets String Columns from Spark but needs to write fixed binary.

Something like this needed as a fix

  private static PrimitiveWriter<UTF8String> uuids(ColumnDescriptor desc) {
    return new UUIDWriter(desc);
  }

  private static class UUIDWriter extends PrimitiveWriter<UTF8String> {
    private ByteBuffer buffer = ByteBuffer.allocate(16);

    private UUIDWriter(ColumnDescriptor desc) {
      super(desc);
    }

    @Override
    public void write(int repetitionLevel, UTF8String string) {
      UUID uuid = UUID.fromString(string.toString());
      buffer.rewind();
      buffer.putLong(uuid.getMostSignificantBits());
      buffer.putLong(uuid.getLeastSignificantBits());
      buffer.rewind();
      column.writeBinary(repetitionLevel, Binary.fromReusedByteBuffer(buffer));
    }
  }

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:4
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
pvarycommented, Apr 26, 2022

@RussellSpitzer: I am not sure that it is still an actual issue - or it was fixed in the current code, but I have found a year ago that Parquet and ORC/Avro expects UUID differently for writes. See: #1881

And this is even before the Spark code 😄

0reactions
RussellSpitzercommented, Jul 15, 2022

Yep, currently the Spark code cannot read or write UUID correctly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cant find uuid in org.apache.spark.sql.types.DataTypes
We have a PostgreSQL table which has UUID as one of the column. How do we send UUID field in Spark dataset(using Java)...
Read more >
Is there a way to store the UUID type in the Spark
It looks like Spark doesn't know how to handle the UUID type, and as you can see, the UUID type existed in both...
Read more >
Use the BigQuery connector with Spark - Google Cloud
Reading and writing data from BigQuery. This example reads data from BigQuery into a Spark DataFrame to perform a word count using the...
Read more >
Dataframe write to SQL Server table containing Always ...
I am using Apache Spark Connector for SQL Server and Azure SQL. When autogenerate field are not included in dataframe, I encountered -...
Read more >
Spark Writes - Apache Iceberg
Note that this mode cannot replace hourly partitions like the dynamic example query because the PARTITION clause can only reference table columns, not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found