Interpreting the upper/lower bounds column returned from querying the .files metadata
See original GitHub issueIceberg has functionality to introspect tables. This is very useful for example to check that a column is properly sorted by checking lower/upper bounds
https://iceberg.apache.org/docs/latest/spark-queries/#files
The return value of this query SELECT * FROM prod.db.table.files
Will return an upper_bounds and lower_bounds column. A map of column ID to binary.
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
To interpret the binary column we register custom UDF functions like this one to convert the bytes in little endian
def _to_int(data):
return int.from_bytes(data, byteorder='little', signed=True)
# register pyspark UDF
to_int = F.udf(_to_int, IntegerType())
# register SQL UDF
spark.udf.register("to_int", _to_int, IntegerType())
Then we can use this function to interpret the data and display it correctly.
-- Stored as 4-byte little-endian
SELECT
min(to_int(lower_bounds[1])) min_a,
max(to_int(upper_bounds[1])) max_a,
min(to_int(lower_bounds[2])) min_b,
max(to_int(upper_bounds[2])) max_b,
min(to_int(lower_bounds[3])) min_c,
max(to_int(upper_bounds[3])) max_c
FROM
prod.db.table.files
Does Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF?
Issue Analytics
- State:
- Created a year ago
- Comments:11 (4 by maintainers)
Top GitHub Comments
cool, please don’t hesitate to reach out for some of our lessons learn etc
Yep, although I haven’t looked at their implementation i’m betting it’s pretty similar since the math is pretty old.