question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interpreting the upper/lower bounds column returned from querying the .files metadata

See original GitHub issue

Iceberg has functionality to introspect tables. This is very useful for example to check that a column is properly sorted by checking lower/upper bounds

https://iceberg.apache.org/docs/latest/spark-queries/#files

The return value of this query SELECT * FROM prod.db.table.files

Will return an upper_bounds and lower_bounds column. A map of column ID to binary.

https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

To interpret the binary column we register custom UDF functions like this one to convert the bytes in little endian

def _to_int(data):
    return int.from_bytes(data, byteorder='little', signed=True)
# register pyspark UDF
to_int = F.udf(_to_int, IntegerType())
# register SQL UDF
spark.udf.register("to_int", _to_int, IntegerType())

Then we can use this function to interpret the data and display it correctly.

-- Stored as 4-byte little-endian
SELECT
    min(to_int(lower_bounds[1])) min_a,
    max(to_int(upper_bounds[1])) max_a,
    min(to_int(lower_bounds[2])) min_b,
    max(to_int(upper_bounds[2])) max_b,
    min(to_int(lower_bounds[3])) min_c,
    max(to_int(upper_bounds[3])) max_c
FROM
    prod.db.table.files

Does Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
cccs-jccommented, Aug 2, 2022

cool, please don’t hesitate to reach out for some of our lessons learn etc

1reaction
RussellSpitzercommented, Aug 2, 2022

@kbendick I’m rather interested in that zorder function. Is this zorder like the databricks zorder. https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

We’ve implemented such a method to sort IP 5 tuples.

Yep, although I haven’t looked at their implementation i’m betting it’s pretty similar since the math is pretty old.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Expose human-readable metrics in metadata tables #4362
Change existing fields on "Files" table (is it possible to break ... Interpreting the upper/lower bounds column returned from querying the ...
Read more >
What is the meaning of partitionColumn, lowerBound ...
partitionColumn is a column which should be used to determine partitions. lowerBound and upperBound determine range of values to be fetched.
Read more >
Iceberg Table Spec
The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents. A snapshot represents the state ......
Read more >
File metadata column | Databricks on AWS
To include the _metadata column in the returned DataFrame, you must explicitly reference it in your query. If the data source contains a...
Read more >
Handling Data Source Errors In Power Query
The AlternativeOutput step returns a table (defined using #table) that has exactly the same columns as the csv file. This table has one...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found