Expose human-readable metrics in metadata tables
See original GitHub issueTo improve Iceberg UX, we could expose iceberg Metrics as human-readable fields in metadata tables.
Problem
Currently there are the following fields in files_table:
Types.NestedField COLUMN_SIZES = optional(108, "column_sizes", MapType.ofRequired(117, 118,
IntegerType.get(), LongType.get()), "Map of column id to total size on disk");
Types.NestedField VALUE_COUNTS = optional(109, "value_counts", MapType.ofRequired(119, 120,
IntegerType.get(), LongType.get()), "Map of column id to total count, including null and NaN");
Types.NestedField NULL_VALUE_COUNTS = optional(110, "null_value_counts", MapType.ofRequired(121, 122,
IntegerType.get(), LongType.get()), "Map of column id to null value count");
Types.NestedField NAN_VALUE_COUNTS = optional(137, "nan_value_counts", MapType.ofRequired(138, 139,
IntegerType.get(), LongType.get()), "Map of column id to number of NaN values in the column");
Types.NestedField LOWER_BOUNDS = optional(125, "lower_bounds", MapType.ofRequired(126, 127,
IntegerType.get(), BinaryType.get()), "Map of column id to lower bound");
Types.NestedField UPPER_BOUNDS = optional(128, "upper_bounds", MapType.ofRequired(129, 130,
IntegerType.get(), BinaryType.get()), "Map of column id to upper bound");
but is hard to use:
- these are all maps requiring knowledge of the field-id.
- the value of bounds are not readable except if converted by
Conversions.fromByteBuffer(type, value)
Options for Metadata Tables
- Change existing fields on “Files” table (is it possible to break backward-compatibility?) If not:
- New fields on “Files” table under a new struct: ie files.resolved_metrics.upper_bounds
Options for Field Types
- A struct mirroring original table schema. The leaf type on the metrics-struct would be either optional(long) for counts, or optional(column type) for bounds.
eg, for a table with (col1 string, col2 struct<col3 string>), we could do:
select * from file.resolved_metrics.column_sizes.col1 //long type
select * from file.resolved_metrics.null_value_counts.col2.col3 //long type
select * from file.resolved_metrics.upper_bounds.col1 // string type
select * from file.resolved_metrics.lower_bounds.col2.col3 //string type
Schema Resolution
Choosing which schema used for resolving this information, given tables can change schema. I think the easiest is just to use the current table schema. If a column in current schema does not have any metric in that data file, can return null.
If user want to resolve file metrics for fields written with older schema, we can implement time-travel for the chosen metadata table, which would use the schema at that point to resolve the file written then.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:9 (4 by maintainers)
Top GitHub Comments
This is a real problem in use, so we’ve even built a gadget that reads these values and converts them into a human-readable format. +1 for this issuse
What we did for the metadata tables in Trino was to produce structs, like @szehon-ho suggested. The idea is that you’d get a struct for each column name that contains value_count, null_count, upper_bound, lower_bound, etc. And each of those would have the correct data type.
I think that’s generally the right direction. The main challenge is that in Spark we originally didn’t have a good way to transform the maps into rows. But now we can do it using
DataTask
, which we use for similar transforms elsewhere.