Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[PROPOSAL] Add support for `debug` or `verbose` mode in clients

See original GitHub issue

Problem

OpenLineage integrations emit core specific facets, but also developer specific facets that fall into the following categories:

debug: used for improving our integrations (ex: the spark_unknown facet is used to capture spark plan metadata for nodes not yet supported)
verbose: used for diagnostic information (ex: the spark.logicalPlan facet is used to analyze the logical plan of your spark job)

Developer specific facets are useful, but, as we’ve see in production, such facets can be very large, change infrequently from run-to-run, and cause degradations in performance for the system where the integrations is configured. For example, complex spark logical plans can exceed 1MB in size; the spark.logicalPlan facet is captured along with every request resulting in storage costs but also increased network overhead. In this proposal, we outline the introduction and usage of two modes: debug and verbose. We also discuss expanding the openlineage.yml to support these modes.

Add `debugMode` and `verboseMode` to `openlineage.yml`

OpenLineage clients are configured using openlineage.yml (see Configuration section for openlineage-java). Within the configuration file, users can define the transport used to emit OL events (ex: Httptransport). Below, we extend openlineage.yml to support the configuration of debugMode and verboseMode:

debugMode:
  enabled: <bool> # Enables debug facets to be emitted along with OL events (ex: 'spark_unknown')
  facets: [array] # Provides a way to specify one or more debug facet to captured (default: 'All')
verboseMode:
  enabled: <bool> # Enables diagnostic information to be emitted along with OL events (ex: 'spark.logicalPlan')
  facets: [array] # Provides a way to specify one or more verbose facet to captured (default: 'All')
transport:
  type: <type>
  # ... transport specific configuration

Example usage of `debugMode` and `verboseMode` in `ol-spark`

# Enable 'debugMode' with specific facets to capture
debugMode:
  enabled: true
  facets: ['spark_unknown'] # Only capture the 'spark_unknown' facet
# Enable 'verboseMode' with facets not defined (defaults to 'All')
verboseMode:
  enabled: true
  # When 'facets' is not provided, all facets are captured by the integration
transport:
  type: HTTP
  url: http://localhost:5000
  auth:
    type: api_key
    api_key: f38d2189-c603-4b46-bdea-e573a3b5a7d5

Then, in class InternalEventHandlerFactory we can define class OpenLineageOptions in our clients to determine which facets are enabled (reference the below only code as an example usage, my not compile 😅):

  @Override
  public Collection<CustomFacetBuilder<?, ? extends RunFacet>> createRunFacetBuilders(
      OpenLineageContext context) {
    Builder<CustomFacetBuilder<?, ? extends RunFacet>> listBuilder;
    listBuilder =
        ImmutableList.<CustomFacetBuilder<?, ? extends RunFacet>>builder()
            .addAll(
                generate(
                    eventHandlerFactories, factory -> factory.createRunFacetBuilders((context))))
            .add(OpenLineageOptions.debugMode().facets(),
                 OpenLineageOptions.verboseMode().facets(),
                 new SparkVersionFacetBuilder(context));
    if (DatabricksEnvironmentFacetBuilder.isDatabricksRuntime()) {
      listBuilder.add(new DatabricksEnvironmentFacetBuilder());
    }
    return listBuilder.build();
  }

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

collado-mikecommented, Nov 7, 2022

I think facets like the spark_unknown are fundamentally different from the spark_logicalPlan facet because of who they serve. The former is really only useful for OpenLineage developers (and those who want to fill their own gaps without working in the OpenLineage codebase). The latter is really there to serve the developers who own the pipelines. Although Marquez doesn’t do anything with that facet currently, the intention is to give people visibility into how their logical plans change from one run to the next or one job version to the next.

Thus, I can see the usefulness of wanting to turn off one set of facets without turning off the other set. But I don’t think debug and verbose communicate that. Personally, I’d vote for more explicit configuration that would allow users to turn on/off specific facets one by one. I think there is enough precedent for enabled and disabled sets that users can either explicitly allow certain facets or selectively disable undesirable ones.

1reaction

pawel-big-lebowskicommented, Nov 7, 2022

I am super happy to see such improvement proposal 🚀 Wouldn’t a single debug or verbose mode be enough to achieve the same goal? By default, they include all facets and have the same behaviour.