Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Design/discuss integration of Great Expectations with Prefect (potentially in Results feature)

See original GitHub issue

Discussion issue to figure out a plan to make Great Expectations integrate seamlessly with Prefect.

In general we are assuming that people want to be able to configure Great Expectations data validators to track onto pieces of data passed through a flow, and want the pipeline to manage actually calling the GE API for them and surfacing errors/alerts/failures in some way that feels totally integrated into Prefect.

Some user experience questions to consider:

a) What is an example of the Python API a core user would use to attach a single great expectation assertion to a task? Multiple? The same one to many tasks? a1) Do they even attach them to tasks? Do they attach them to something else (ie Results?) a2) What about a Core server/Cloud user: is there ever a world where GE validators are configured directly through the UI?

b) How are the validation results from great expectations surfaced in the Prefect logs? Are they visualized somehow in the UI?

c) Are there assumptions or conventions Prefect should make/support to autodetect GE assertions on disk? How does this relate to the “Expectations on rails” framework in beta in GE? c1) how does Prefect (and --dun dun dun-- dask) play with the fact that Great Expectations is mainly configured using file based configuration?

d) Where/how in the pipeline do we do validation checks for people? Where/how should users configure to turn this check on or off, besides removing the validator from the code (for example, in a global configuration toggle?)

e) If people call great expectation validators themselves in a task via a Prefect API such as Result.validate(), do we do anything special for them with the output?

f) can we provide better, Prefect-based semantics for what to do on failure of a GE validation since the pipeline can control the execution flow in reaction to the validation failure (ex potentially allow flow configuration for retrying up a task tree when a downstream validator fails)

g) can we integrate the GE data docs metadata into our UI somehow (simplest case is to link out, though this can get infinitely more fancy)

**Curveball question: Is there a need (either in addition or in replacement of integrated pipeline checks) of an abstract GE task in the task library that can be easily used as a terminal/reference task?

Issue Analytics

State:
Created 3 years ago
Reactions:3
Comments:6

Top GitHub Comments

4reactions

lauralorenzcommented, May 5, 2020

Based on conversations last week tl;dr IMHO we should move forward immediately only with a Task Library task that exposes an ad-hoc validation configurable with user-configured data sources and validators. There is some good advice of how to integrate Great Expectations as a node in a pipeline framework in this way on their docs here. Along with that we should provide a tutorial/docs here and in GE docs that explain how to use the Prefect task to add GE validation to a pipeline. UPDATE: Issue for this is at https://github.com/PrefectHQ/prefect/issues/2489

This is motivated mostly by conversations in slack, where a few users mentioned they do use Great Expectations in their Prefect pipelines either now or in test, but more on the basis of ad-hoc/final testing of data at the end of an ETL, not necessarily regularly along the way (though there was recognition that it could be useful for complex data intermediates). This motivates the task library use case more than the work to embed validation throughout the pipeline, since they can be integrated as needed at the end or for the few complex data intermediaries as their own task nodes.

I personally don’t think we have enough information or motivation yet to pursue the more heavy integrations, but I have edited OP with questions related to the design of a heavier integration based on conversations to date and am leaving the issue open as I collect more information and welcome discussion!

0reactions

lcorneliussencommented, Oct 1, 2020

Some inspiration: https://greatexpectations.io/blog/dagster-integration-announcement/

Top Results From Across the Web

Using Great Expectations With Prefect to Ensure Data Quality

By adding Great Expectations quality checks into a Prefect workflow, data teams can increase overall data reliability and confidence.

Don't panic! Prefect and Great Expectations have got your ...

We're excited to announce that we've teamed up with the Prefect team to release a (new and improved) integration with Great Expectations!

An Intro to Data Assistants with Great Expectations - YouTube

Documentation to use Data Assistants: https://docs. greatexpectations.io/doc... Please feel free to drop your questions and feedback here: ...

Chapter 06, Group I, US EPR DC Partial P2 SER (without 6.2.2)

temperatures greater that 482°F. Staff guidance related to thermal aging embrittlement of ... Large scale integral tests of the U.S. EPR containment are...

NASA Technical Reports Server - Planet4589.org

These data have great potential for severe storms research and as a ... of the results have been integrated into some of MSFC's...