Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Expose human-readable metrics in metadata tables

See original GitHub issue

To improve Iceberg UX, we could expose iceberg Metrics as human-readable fields in metadata tables.

Problem

Currently there are the following fields in files_table:

  Types.NestedField COLUMN_SIZES = optional(108, "column_sizes", MapType.ofRequired(117, 118,
      IntegerType.get(), LongType.get()), "Map of column id to total size on disk");
  Types.NestedField VALUE_COUNTS = optional(109, "value_counts", MapType.ofRequired(119, 120,
      IntegerType.get(), LongType.get()), "Map of column id to total count, including null and NaN");
  Types.NestedField NULL_VALUE_COUNTS = optional(110, "null_value_counts", MapType.ofRequired(121, 122,
      IntegerType.get(), LongType.get()), "Map of column id to null value count");
  Types.NestedField NAN_VALUE_COUNTS = optional(137, "nan_value_counts", MapType.ofRequired(138, 139,
      IntegerType.get(), LongType.get()), "Map of column id to number of NaN values in the column");
  Types.NestedField LOWER_BOUNDS = optional(125, "lower_bounds", MapType.ofRequired(126, 127,
      IntegerType.get(), BinaryType.get()), "Map of column id to lower bound");
  Types.NestedField UPPER_BOUNDS = optional(128, "upper_bounds", MapType.ofRequired(129, 130,
      IntegerType.get(), BinaryType.get()), "Map of column id to upper bound");

but is hard to use:

these are all maps requiring knowledge of the field-id.
the value of bounds are not readable except if converted by Conversions.fromByteBuffer(type, value)

Options for Metadata Tables

Change existing fields on “Files” table (is it possible to break backward-compatibility?) If not:
New fields on “Files” table under a new struct: ie files.resolved_metrics.upper_bounds

Options for Field Types

A struct mirroring original table schema. The leaf type on the metrics-struct would be either optional(long) for counts, or optional(column type) for bounds.

eg, for a table with (col1 string, col2 struct<col3 string>), we could do:

select * from file.resolved_metrics.column_sizes.col1  //long type
select * from file.resolved_metrics.null_value_counts.col2.col3 //long type
select * from file.resolved_metrics.upper_bounds.col1 // string type
select * from file.resolved_metrics.lower_bounds.col2.col3 //string type

Schema Resolution

Choosing which schema used for resolving this information, given tables can change schema. I think the easiest is just to use the current table schema. If a column in current schema does not have any metric in that data file, can return null.

If user want to resolve file metrics for fields written with older schema, we can implement time-travel for the chosen metadata table, which would use the schema at that point to resolve the file written then.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

hililiweicommented, Mar 19, 2022

This is a real problem in use, so we’ve even built a gadget that reads these values and converts them into a human-readable format. +1 for this issuse

1reaction

rdbluecommented, Mar 28, 2022

What we did for the metadata tables in Trino was to produce structs, like @szehon-ho suggested. The idea is that you’d get a struct for each column name that contains value_count, null_count, upper_bound, lower_bound, etc. And each of those would have the correct data type.

I think that’s generally the right direction. The main challenge is that in Spark we originally didn’t have a good way to transform the maps into rows. But now we can do it using DataTask, which we use for similar transforms elsewhere.