Spark: Cannot read or write UUID columns
See original GitHub issueBecause of the String -> Fixed Binary conversion the readers and writers are both incorrect.
The vectorized reader initializes a FixedBinary reader on a column we report is a String causing an unsupported reader exception.
java.lang.UnsupportedOperationException: Unsupported type: UTF8String
at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessor.getUTF8String(ArrowVectorAccessor.java:82)
at org.apache.iceberg.spark.data.vectorized.IcebergArrowColumnVector.getUTF8String(IcebergArrowColumnVector.java:140)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Sour
The writer is broken because it gets String Columns from Spark but needs to write fixed binary.
Something like this needed as a fix
private static PrimitiveWriter<UTF8String> uuids(ColumnDescriptor desc) {
return new UUIDWriter(desc);
}
private static class UUIDWriter extends PrimitiveWriter<UTF8String> {
private ByteBuffer buffer = ByteBuffer.allocate(16);
private UUIDWriter(ColumnDescriptor desc) {
super(desc);
}
@Override
public void write(int repetitionLevel, UTF8String string) {
UUID uuid = UUID.fromString(string.toString());
buffer.rewind();
buffer.putLong(uuid.getMostSignificantBits());
buffer.putLong(uuid.getLeastSignificantBits());
buffer.rewind();
column.writeBinary(repetitionLevel, Binary.fromReusedByteBuffer(buffer));
}
}
Issue Analytics
- State:
- Created a year ago
- Reactions:4
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Cant find uuid in org.apache.spark.sql.types.DataTypes
We have a PostgreSQL table which has UUID as one of the column. How do we send UUID field in Spark dataset(using Java)...
Read more >Is there a way to store the UUID type in the Spark
It looks like Spark doesn't know how to handle the UUID type, and as you can see, the UUID type existed in both...
Read more >Use the BigQuery connector with Spark - Google Cloud
Reading and writing data from BigQuery. This example reads data from BigQuery into a Spark DataFrame to perform a word count using the...
Read more >Dataframe write to SQL Server table containing Always ...
I am using Apache Spark Connector for SQL Server and Azure SQL. When autogenerate field are not included in dataframe, I encountered -...
Read more >Spark Writes - Apache Iceberg
Note that this mode cannot replace hourly partitions like the dynamic example query because the PARTITION clause can only reference table columns, not...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@RussellSpitzer: I am not sure that it is still an actual issue - or it was fixed in the current code, but I have found a year ago that Parquet and ORC/Avro expects UUID differently for writes. See: #1881
And this is even before the Spark code 😄
Yep, currently the Spark code cannot read or write UUID correctly.