question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-1367] Integration with great-expectations

See original GitHub issue

Description

I have been using kedro for a little while now for data engineering/cleaning. A standard step in these processes is testing the data at different steps of the pipeline. To do this, I’ve been using great-expectations to write expectations pipelines that are essentially slotted in between different cleaning/engineering steps. It would be great to have a way to point a kedro.io dataset type towards a suite of expectations, as defined in great_expectations.

Context

Testing data is a pretty essential step of data pipelines. great-expectations offers a really nice suite of tools for communicating and testing what is expected out of a dataset/pipeline.

Possible Implementation

(Optional) The first method that jumps to mind is extending dataset types in kedro.io to use expectation suites. In particular, this could be done by extending the _save() method to run a set of expectations on a dataset every time it is saved, as well as saving the results of the run to be used in great_expectations visualization features. Locations of expectations suites would be another attribute added when defining the dataset in the Data Catalog. Same idea as filepath: data/... i.e. expectation_suites: -.../...

Possible Alternatives

(Optional) No idea where to start here, but an alternative path may be a plugin.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:7
  • Comments:25 (13 by maintainers)

github_iconTop GitHub Comments

9reactions
yetudadacommented, Apr 15, 2020

@EigenJT @JasonLeungQB @deepyaman You’ll be excited to note that we have built a kedro-ge plugin internally, that we will be open-sourcing. We’re actually talking to the Great Expectations team on this work, and it will be a collaborative effort when we release the plugin.

@ZainPatelQB @tsanikgr are behind this amazing work.

7reactions
mzjp2commented, May 11, 2020

Quick update on the work we’re doing here. We’ve done an internal release a week or two ago with support for validating datasets as part of your pipeline, declaring the actions you want to be taken in a config file.

We’re dogfooding this intensely at the moment, with lots of internal feedback and are planning on doing quite a lot more work, but inching closer to open sourcing!

GE is also releasing a pretty big breaking 0.11 change soon which we’ll have to spend some time catching up with. Glad to see the enthusiasm in this thread!

Read more comments on GitHub >

github_iconTop Results From Across the Web

[KED-1367] Integration with great-expectations #207 - GitHub
Testing data is a pretty essential step of data pipelines. great-expectations offers a really nice suite of tools for communicating and testing ...
Read more >
Step 3: Pipeline integration - Great Expectations!
This tutorial covers integrating Great Expectations (GE) into a data pipeline. We will continue the example we used in the previous section, ...
Read more >
Integrating Great Expectations into a Pipeline - YouTube
Always know what to expect from your data.This video covers validating batches of a data asset using the Great Expectations data pipeline ...
Read more >
Great Expectations and Meltano Integration - YouTube
This video was taken during the April 2022 Great Expectations monthly community event. You can join the next one here: ...
Read more >
Great Expectations x Flyte Integration Demo - YouTube
More about Flyte:https://docs.flyte.org/en/latest/https:// greatexpectations.io/blog/flyte- great - expectations -announcement/Join the next ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found