Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dolt // Kedro

See original GitHub issue

Introduction

The Dolt team is interested in exposing DoltDB as a Kedro DataSet type. We are also excited about the idea of exposing diffing and other SQL features for change capture if useful to the Kedro team.

I briefly filled out the bullet points below, but the write-up in my draft PR is more straight-to-the-point.

Draft PR -> https://github.com/dolthub/kedro/pull/1

Included in the PR – brief tutorial notes/comments.

The starter integration is not heavily tested, we don’t intend these additions to make it into a final PR, we are most interested in design feedback.

Background

Dolt is an SQL-database with Git-versioning. Standalone it can be datasource for workflow managers. Without custom code it just does what MySQL or SQLLite does. Varying levels of Git-functionality can be included in integrations to provide versioning, diffing, merging and reproducibility for tabular datasets that is unique to our storage layer (we have quite a few blogs on this). We have gotten a lot of positive feedback so far in this space and hope we can help solve thorny versioning problems!

Problem

What’s in scope

Generic database integration
Commits in database that end-user manages themselves
Metadata that helps users and/or Kedro track lineage
Application database that extends workflow change-capture

What’s not in scope

Design

Kedro remote-object interface that I’ve focused on:

pre-configured data catalogues
tabular datasets
save and load methods (and others)
data journaling (at the catalog layer)

I made an example Dolt integration that behaves similar to Pandas DataFrames for end-users, but uses Dolt to capture lineage and deltas of those tables for users.

Metadata storage, remotes, and advanced branching logic are all optional extensions beyond an otherwise pd.DataFrame experience.

Journaling scope limited to data catalogs, and versioning having a different meaning in Dolt are two friction points that I haven’t addressed in my sample code.

Alternatives considered

Two other integration patterns:

Expose the Dolt database itself, that users can interact with natively
Context manager that can “squash” the metadata log by wrapping an execution runtime.

Neither of these struck me as particularly suited to Kedro’s existing UX.

edit: SQL-server integration was mentioned as more appealing than an FS-based approach in our intro call. The two are interchangeable, FS is just easier to demo and test currently.

Testing

Explain the testing strategies to verify your design correctness (if possible). TODO

Rollout strategy

Is the change backward compatible? If not, what is the migration strategy? TODO (short answer yes)

Future iterations

Will there be future iterations of this design?

Hopefully! We are excited for feedback!

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:17 (7 by maintainers)

Top GitHub Comments

4reactions

limdautocommented, Apr 23, 2021

Hi @max-hoffman, thank you very much for taking the time to write the issue and for making the demo. Apology for the delay in response, partially because I have been in some training all week and partially because I really want to wrap my head around what exactly we are trying to accomplish here. First thing first, such an awesome piece of technology you and the Dolt team have built there. I can’t express enough how excited I am with Dolt. It feels like having a superpower I don’t yet know what to do with.

Regarding an integration with Kedro, you have touched on many great ideas in your issue and in the demo. However, please allow me to take a step back and look at this from a Kedro user perspective first. As a Kedro user, I believe I can already use Dolt right now as a data source in Kedro without any extra dataset, thanks to your SQL interface. I would use it wherever I want to track different versions of my tabular datasets. It would be an alternative option to Kedro’s path-based VersionedDataSet for different tabular formats, e.g. csv.

The workflow is:

Since Kedro allows users to inject extra behaviours to its execution timeline through a mechanism called Hooks, I’d write a before_pipeline_run hook to start a Dolt SQL server and a after_pipeline_run hook to commit the data and stop the SQL server:

class ProjectHooks:
    def __init__(self):
        project_path = Path(__file__).parent.parent.parent
        self.dolt = Dolt(project_path)
        self.dolt_sql_server = DoltSQLServerContext(self.dolt, ServerConfig())

    @hook_impl
    def before_pipeline_run(self):
        self.dolt_sql_server.start_server()

    @hook_impl
    def after_pipeline_run(self, run_params: Dict[str, Any]):
        self.dolt.add(".")
        try:
            self.dolt.commit(
                message=f"Update data from Kedro run {run_params['run_id']} with params {run_params['extra_params']}"
            )
        except DoltException as e:
            if "no changes added to commit" not in str(e):
                raise
        finally:
            self.dolt_sql_server.stop_server()

Then whenever I want to write data to Dolt, I’d just use the SQL interface through the built-in pandas.SQLTableDataSet. For example:

example_test_x:
  type: pandas.SQLTableDataSet
  table_name: example_test_x
  credentials:
    con: mysql://root@localhost:3306/kedro_dolt_demo
  save_args:
    if_exists: replace

And voila! If your data change between kedro run, it’d show up as Dolt commits in dolt log. For example, I have setup an example project here to demonstrate this. It’s exactly the same as a default project created with our pandas-iris starter:

kedro new --starter=pandas-iris

with a modified hooks.py and catalog.yml to integrate with Dolt as explained above*. The pipeline contains a node that splits data for training and testing purpose based on some parameters. When I run the pipeline with different train/test split ratio:

kedro run --params example_test_data_ratio:0.1
kedro run --params example_test_data_ratio:0.2

there are corresponding commits in dolt:

D:\kedro-dolt-demo (main -> origin) 
(kedro-38) λ dolt log
commit m3112s3uuird3rtjt28cdeitp5prp6td
Author: Lim Hoang <limdauto@gmail.com>
Date:   Fri Apr 23 23:46:04 +0100 2021

        Update data from Kedro run 2021-04-23T22.45.45.157Z with params {'example_test_data_ratio': 0.2}

commit jc77hh54t97na1hs8i6k8b5pfrh7tiej
Author: Lim Hoang <limdauto@gmail.com>
Date:   Fri Apr 23 23:45:01 +0100 2021

        Update data from Kedro run 2021-04-23T22.44.41.926Z with params {'example_test_data_ratio': 0.1}

We can now all of Dolt tools to interact with the data, e.g. dolt diff

Screenshot 2021-04-24 000933

I believe this workflow is more familiar and idiomatic to Kedro users while still showcasing the values that Dolt would bring. If you are happy with this approach, we could definitely write it up in our documentation in the section for Tools integration next to Spark. Some further ideas to improve upon this would be to allow users to checkout different data branches by passing in an extra param from the CLI, e.g. kedro run --params dolt_branch:yesterday_data and use dolt.checkout programmatically in before_pipeline_run hook. The dream here would be to be able to incorporate this concept of data branches with data scientists’ experimentation tracking tools, which we also do through Hooks. Writing this up takes a bit more time so I will leave it till another day.

(*) I lie a little bit here. Even though I recommend we start and stop Dolt SQL server programmtically, I actually had to do it manually in my demo project with dolt sql-server --max-connections=10 from another terminal. When I start the server from another terminal, I got the nice diff of my data as presented above. However, when I start it programmatically, the diff simply says table deleted/table added. Do you have any idea why? Our SQLTableDataSetuses pandas read_sql_table and to_sql underneath. Also thanks for fixing the --max-connections yesterday haha… Otherwise it was hanging for me before.

2reactions

max-hoffmancommented, May 19, 2021

I made and released a plugin here – https://github.com/dolthub/kedro-dolt.