question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Expose human-readable metrics in metadata tables

See original GitHub issue

To improve Iceberg UX, we could expose iceberg Metrics as human-readable fields in metadata tables.

Problem

Currently there are the following fields in files_table:

  Types.NestedField COLUMN_SIZES = optional(108, "column_sizes", MapType.ofRequired(117, 118,
      IntegerType.get(), LongType.get()), "Map of column id to total size on disk");
  Types.NestedField VALUE_COUNTS = optional(109, "value_counts", MapType.ofRequired(119, 120,
      IntegerType.get(), LongType.get()), "Map of column id to total count, including null and NaN");
  Types.NestedField NULL_VALUE_COUNTS = optional(110, "null_value_counts", MapType.ofRequired(121, 122,
      IntegerType.get(), LongType.get()), "Map of column id to null value count");
  Types.NestedField NAN_VALUE_COUNTS = optional(137, "nan_value_counts", MapType.ofRequired(138, 139,
      IntegerType.get(), LongType.get()), "Map of column id to number of NaN values in the column");
  Types.NestedField LOWER_BOUNDS = optional(125, "lower_bounds", MapType.ofRequired(126, 127,
      IntegerType.get(), BinaryType.get()), "Map of column id to lower bound");
  Types.NestedField UPPER_BOUNDS = optional(128, "upper_bounds", MapType.ofRequired(129, 130,
      IntegerType.get(), BinaryType.get()), "Map of column id to upper bound");

but is hard to use:

  • these are all maps requiring knowledge of the field-id.
  • the value of bounds are not readable except if converted by Conversions.fromByteBuffer(type, value)

Options for Metadata Tables

  1. Change existing fields on “Files” table (is it possible to break backward-compatibility?) If not:
  2. New fields on “Files” table under a new struct: ie files.resolved_metrics.upper_bounds

Options for Field Types

  • A struct mirroring original table schema. The leaf type on the metrics-struct would be either optional(long) for counts, or optional(column type) for bounds.

eg, for a table with (col1 string, col2 struct<col3 string>), we could do:

select * from file.resolved_metrics.column_sizes.col1  //long type
select * from file.resolved_metrics.null_value_counts.col2.col3 //long type
select * from file.resolved_metrics.upper_bounds.col1 // string type
select * from file.resolved_metrics.lower_bounds.col2.col3 //string type

Schema Resolution

Choosing which schema used for resolving this information, given tables can change schema. I think the easiest is just to use the current table schema. If a column in current schema does not have any metric in that data file, can return null.

If user want to resolve file metrics for fields written with older schema, we can implement time-travel for the chosen metadata table, which would use the schema at that point to resolve the file written then.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:3
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
hililiweicommented, Mar 19, 2022

This is a real problem in use, so we’ve even built a gadget that reads these values and converts them into a human-readable format. +1 for this issuse

1reaction
rdbluecommented, Mar 28, 2022

What we did for the metadata tables in Trino was to produce structs, like @szehon-ho suggested. The idea is that you’d get a struct for each column name that contains value_count, null_count, upper_bound, lower_bound, etc. And each of those would have the correct data type.

I think that’s generally the right direction. The main challenge is that in Spark we originally didn’t have a good way to transform the maps into rows. But now we can do it using DataTask, which we use for similar transforms elsewhere.

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Deep Dive Into the Four Types of Prometheus Metrics
In this introduction to the different open-source metrics standards, we focus on the Prometheus metric types. This post is the first of a...
Read more >
Publishing custom metrics - Amazon CloudWatch
You can publish your own metrics to CloudWatch using the AWS CLI or an API. You can view statistical graphs of your published...
Read more >
Telemetry.Metrics v0.6.1 - HexDocs
:tag_values - a function that receives the metadata and returns a map with the tags as keys and their respective values. Defaults to...
Read more >
[Feature] dbt should know about metrics · Issue #4071 - GitHub
A metric is a timeseries aggregation over a table that supports zero or more dimensions. These metrics can be encoded in schema.yml files....
Read more >
Prometheus Metrics, Implementing your Application - Sysdig
Prometheus metrics let you easily instrument your Java, Golang, Python or Javascript app. Sysdig Monitor supports Prometheus metrics out of ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found