Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spark: Cannot read or write UUID columns

See original GitHub issue

Because of the String -> Fixed Binary conversion the readers and writers are both incorrect.

The vectorized reader initializes a FixedBinary reader on a column we report is a String causing an unsupported reader exception.

java.lang.UnsupportedOperationException: Unsupported type: UTF8String
	at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:82)
	at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:140)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Sour

The writer is broken because it gets String Columns from Spark but needs to write fixed binary.

Something like this needed as a fix

  private static PrimitiveWriter<UTF8String> uuids(ColumnDescriptor desc) {
    return new UUIDWriter(desc);
  }

  private static class UUIDWriter extends PrimitiveWriter<UTF8String> {
    private ByteBuffer buffer = ByteBuffer.allocate(16);

    private UUIDWriter(ColumnDescriptor desc) {
      super(desc);
    }

    @Override
    public void write(int repetitionLevel, UTF8String string) {
      UUID uuid = UUID.fromString(string.toString());
      buffer.rewind();
      buffer.putLong(uuid.getMostSignificantBits());
      buffer.putLong(uuid.getLeastSignificantBits());
      buffer.rewind();
      column.writeBinary(repetitionLevel, Binary.fromReusedByteBuffer(buffer));
    }
  }

Issue Analytics

State:
Created a year ago
Reactions:4
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

pvarycommented, Apr 26, 2022

@RussellSpitzer: I am not sure that it is still an actual issue - or it was fixed in the current code, but I have found a year ago that Parquet and ORC/Avro expects UUID differently for writes. See: #1881

And this is even before the Spark code 😄

0reactions

RussellSpitzercommented, Jul 15, 2022

Yep, currently the Spark code cannot read or write UUID correctly.

Top Results From Across the Web

Cant find uuid in org.apache.spark.sql.types.DataTypes

We have a PostgreSQL table which has UUID as one of the column. How do we send UUID field in Spark dataset(using Java)...

Is there a way to store the UUID type in the Spark

It looks like Spark doesn't know how to handle the UUID type, and as you can see, the UUID type existed in both...

Use the BigQuery connector with Spark - Google Cloud

Reading and writing data from BigQuery. This example reads data from BigQuery into a Spark DataFrame to perform a word count using the...

Dataframe write to SQL Server table containing Always ...

I am using Apache Spark Connector for SQL Server and Azure SQL. When autogenerate field are not included in dataframe, I encountered -...

Spark Writes - Apache Iceberg

Note that this mode cannot replace hourly partitions like the dynamic example query because the PARTITION clause can only reference table columns, not...