Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interpreting the upper/lower bounds column returned from querying the .files metadata

See original GitHub issue

Iceberg has functionality to introspect tables. This is very useful for example to check that a column is properly sorted by checking lower/upper bounds

https://iceberg.apache.org/docs/latest/spark-queries/#files

The return value of this query SELECT * FROM prod.db.table.files

Will return an upper_bounds and lower_bounds column. A map of column ID to binary.

https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

To interpret the binary column we register custom UDF functions like this one to convert the bytes in little endian

def _to_int(data):
    return int.from_bytes(data, byteorder='little', signed=True)
# register pyspark UDF
to_int = F.udf(_to_int, IntegerType())
# register SQL UDF
spark.udf.register("to_int", _to_int, IntegerType())

Then we can use this function to interpret the data and display it correctly.

-- Stored as 4-byte little-endian
SELECT
    min(to_int(lower_bounds[1])) min_a,
    max(to_int(upper_bounds[1])) max_a,
    min(to_int(lower_bounds[2])) min_b,
    max(to_int(upper_bounds[2])) max_b,
    min(to_int(lower_bounds[3])) min_c,
    max(to_int(upper_bounds[3])) max_c
FROM
    prod.db.table.files

Does Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF?

Issue Analytics

State:
Created a year ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

cccs-jccommented, Aug 2, 2022

cool, please don’t hesitate to reach out for some of our lessons learn etc

1reaction

RussellSpitzercommented, Aug 2, 2022

@kbendick I’m rather interested in that zorder function. Is this zorder like the databricks zorder. https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

We’ve implemented such a method to sort IP 5 tuples.

Yep, although I haven’t looked at their implementation i’m betting it’s pretty similar since the math is pretty old.

Top Results From Across the Web

Expose human-readable metrics in metadata tables #4362

Change existing fields on "Files" table (is it possible to break ... Interpreting the upper/lower bounds column returned from querying the ...

What is the meaning of partitionColumn, lowerBound ...

partitionColumn is a column which should be used to determine partitions. lowerBound and upperBound determine range of values to be fetched.

Iceberg Table Spec

The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents. A snapshot represents the state ......

File metadata column | Databricks on AWS

To include the _metadata column in the returned DataFrame, you must explicitly reference it in your query. If the data source contains a...

Handling Data Source Errors In Power Query

The AlternativeOutput step returns a table (defined using #table) that has exactly the same columns as the csv file. This table has one...