[KED-1367] Integration with great-expectations
See original GitHub issueDescription
I have been using kedro for a little while now for data engineering/cleaning. A standard step in these processes is testing the data at different steps of the pipeline. To do this, I’ve been using great-expectations to write expectations pipelines that are essentially slotted in between different cleaning/engineering steps. It would be great to have a way to point a kedro.io dataset type towards a suite of expectations, as defined in great_expectations.
Context
Testing data is a pretty essential step of data pipelines. great-expectations offers a really nice suite of tools for communicating and testing what is expected out of a dataset/pipeline.
Possible Implementation
(Optional) The first method that jumps to mind is extending dataset types in kedro.io to use expectation suites. In particular, this could be done by extending the _save() method to run a set of expectations on a dataset every time it is saved, as well as saving the results of the run to be used in great_expectations visualization features. Locations of expectations suites would be another attribute added when defining the dataset in the Data Catalog. Same idea as filepath: data/...
i.e. expectation_suites: -.../...
Possible Alternatives
(Optional) No idea where to start here, but an alternative path may be a plugin.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:7
- Comments:25 (13 by maintainers)
Top GitHub Comments
@EigenJT @JasonLeungQB @deepyaman You’ll be excited to note that we have built a
kedro-ge
plugin internally, that we will be open-sourcing. We’re actually talking to the Great Expectations team on this work, and it will be a collaborative effort when we release the plugin.@ZainPatelQB @tsanikgr are behind this amazing work.
Quick update on the work we’re doing here. We’ve done an internal release a week or two ago with support for validating datasets as part of your pipeline, declaring the actions you want to be taken in a config file.
We’re dogfooding this intensely at the moment, with lots of internal feedback and are planning on doing quite a lot more work, but inching closer to open sourcing!
GE is also releasing a pretty big breaking 0.11 change soon which we’ll have to spend some time catching up with. Glad to see the enthusiasm in this thread!