[PROPOSAL] Add support for `debug` or `verbose` mode in clients
See original GitHub issueProblem
OpenLineage integrations emit core specific facets, but also developer specific facets that fall into the following categories:
debug
: used for improving our integrations (ex: thespark_unknown
facet is used to capture spark plan metadata for nodes not yet supported)verbose
: used for diagnostic information (ex: thespark.logicalPlan
facet is used to analyze the logical plan of your spark job)
Developer specific facets are useful, but, as we’ve see in production, such facets can be very large, change infrequently from run-to-run, and cause degradations in performance for the system where the integrations is configured. For example, complex spark logical plans can exceed 1MB
in size; the spark.logicalPlan
facet is captured along with every request resulting in storage costs but also increased network overhead. In this proposal, we outline the introduction and usage of two modes: debug
and verbose
. We also discuss expanding the openlineage.yml
to support these modes.
Add debugMode
and verboseMode
to openlineage.yml
OpenLineage clients are configured using openlineage.yml
(see Configuration section for openlineage-java
). Within the configuration file, users can define the transport
used to emit OL events (ex: Httptransport
). Below, we extend openlineage.yml
to support the configuration of debugMode
and verboseMode
:
debugMode:
enabled: <bool> # Enables debug facets to be emitted along with OL events (ex: 'spark_unknown')
facets: [array] # Provides a way to specify one or more debug facet to captured (default: 'All')
verboseMode:
enabled: <bool> # Enables diagnostic information to be emitted along with OL events (ex: 'spark.logicalPlan')
facets: [array] # Provides a way to specify one or more verbose facet to captured (default: 'All')
transport:
type: <type>
# ... transport specific configuration
Example usage of debugMode
and verboseMode
in ol-spark
# Enable 'debugMode' with specific facets to capture
debugMode:
enabled: true
facets: ['spark_unknown'] # Only capture the 'spark_unknown' facet
# Enable 'verboseMode' with facets not defined (defaults to 'All')
verboseMode:
enabled: true
# When 'facets' is not provided, all facets are captured by the integration
transport:
type: HTTP
url: http://localhost:5000
auth:
type: api_key
api_key: f38d2189-c603-4b46-bdea-e573a3b5a7d5
Then, in class InternalEventHandlerFactory
we can define class OpenLineageOptions
in our clients to determine which facets are enabled (reference the below only code as an example usage, my not compile 😅):
@Override
public Collection<CustomFacetBuilder<?, ? extends RunFacet>> createRunFacetBuilders(
OpenLineageContext context) {
Builder<CustomFacetBuilder<?, ? extends RunFacet>> listBuilder;
listBuilder =
ImmutableList.<CustomFacetBuilder<?, ? extends RunFacet>>builder()
.addAll(
generate(
eventHandlerFactories, factory -> factory.createRunFacetBuilders((context))))
.add(OpenLineageOptions.debugMode().facets(),
OpenLineageOptions.verboseMode().facets(),
new SparkVersionFacetBuilder(context));
if (DatabricksEnvironmentFacetBuilder.isDatabricksRuntime()) {
listBuilder.add(new DatabricksEnvironmentFacetBuilder());
}
return listBuilder.build();
}
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:6 (4 by maintainers)
I think facets like the
spark_unknown
are fundamentally different from thespark_logicalPlan
facet because of who they serve. The former is really only useful for OpenLineage developers (and those who want to fill their own gaps without working in the OpenLineage codebase). The latter is really there to serve the developers who own the pipelines. Although Marquez doesn’t do anything with that facet currently, the intention is to give people visibility into how their logical plans change from one run to the next or one job version to the next.Thus, I can see the usefulness of wanting to turn off one set of facets without turning off the other set. But I don’t think
debug
andverbose
communicate that. Personally, I’d vote for more explicit configuration that would allow users to turn on/off specific facets one by one. I think there is enough precedent forenabled
anddisabled
sets that users can either explicitly allow certain facets or selectively disable undesirable ones.I am super happy to see such improvement proposal 🚀 Wouldn’t a single
debug
orverbose
mode be enough to achieve the same goal? By default, they includeall
facets and have the same behaviour.