Design/discuss integration of Great Expectations with Prefect (potentially in Results feature)
See original GitHub issueDiscussion issue to figure out a plan to make Great Expectations integrate seamlessly with Prefect.
In general we are assuming that people want to be able to configure Great Expectations data validators to track onto pieces of data passed through a flow, and want the pipeline to manage actually calling the GE API for them and surfacing errors/alerts/failures in some way that feels totally integrated into Prefect.
Some user experience questions to consider:
a) What is an example of the Python API a core user would use to attach a single great expectation assertion to a task? Multiple? The same one to many tasks? a1) Do they even attach them to tasks? Do they attach them to something else (ie Results?) a2) What about a Core server/Cloud user: is there ever a world where GE validators are configured directly through the UI?
b) How are the validation results from great expectations surfaced in the Prefect logs? Are they visualized somehow in the UI?
c) Are there assumptions or conventions Prefect should make/support to autodetect GE assertions on disk? How does this relate to the “Expectations on rails” framework in beta in GE? c1) how does Prefect (and --dun dun dun-- dask) play with the fact that Great Expectations is mainly configured using file based configuration?
d) Where/how in the pipeline do we do validation checks for people? Where/how should users configure to turn this check on or off, besides removing the validator from the code (for example, in a global configuration toggle?)
e) If people call great expectation validators themselves in a task via a Prefect API such as Result.validate()
, do we do anything special for them with the output?
f) can we provide better, Prefect-based semantics for what to do on failure of a GE validation since the pipeline can control the execution flow in reaction to the validation failure (ex potentially allow flow configuration for retrying up a task tree when a downstream validator fails)
g) can we integrate the GE data docs metadata into our UI somehow (simplest case is to link out, though this can get infinitely more fancy)
**Curveball question: Is there a need (either in addition or in replacement of integrated pipeline checks) of an abstract GE task in the task library that can be easily used as a terminal/reference task?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:6
Based on conversations last week tl;dr IMHO we should move forward immediately only with a Task Library task that exposes an ad-hoc validation configurable with user-configured data sources and validators. There is some good advice of how to integrate Great Expectations as a node in a pipeline framework in this way on their docs here. Along with that we should provide a tutorial/docs here and in GE docs that explain how to use the Prefect task to add GE validation to a pipeline. UPDATE: Issue for this is at https://github.com/PrefectHQ/prefect/issues/2489
This is motivated mostly by conversations in slack, where a few users mentioned they do use Great Expectations in their Prefect pipelines either now or in test, but more on the basis of ad-hoc/final testing of data at the end of an ETL, not necessarily regularly along the way (though there was recognition that it could be useful for complex data intermediates). This motivates the task library use case more than the work to embed validation throughout the pipeline, since they can be integrated as needed at the end or for the few complex data intermediaries as their own task nodes.
I personally don’t think we have enough information or motivation yet to pursue the more heavy integrations, but I have edited OP with questions related to the design of a heavier integration based on conversations to date and am leaving the issue open as I collect more information and welcome discussion!
Some inspiration: https://greatexpectations.io/blog/dagster-integration-announcement/