Dolt // Kedro
See original GitHub issueIntroduction
The Dolt team is interested in exposing DoltDB as a Kedro DataSet type. We are also excited about the idea of exposing diffing and other SQL features for change capture if useful to the Kedro team.
I briefly filled out the bullet points below, but the write-up in my draft PR is more straight-to-the-point.
Draft PR -> https://github.com/dolthub/kedro/pull/1
Included in the PR – brief tutorial notes/comments.
The starter integration is not heavily tested, we don’t intend these additions to make it into a final PR, we are most interested in design feedback.
Background
Dolt is an SQL-database with Git-versioning. Standalone it can be datasource for workflow managers. Without custom code it just does what MySQL or SQLLite does. Varying levels of Git-functionality can be included in integrations to provide versioning, diffing, merging and reproducibility for tabular datasets that is unique to our storage layer (we have quite a few blogs on this). We have gotten a lot of positive feedback so far in this space and hope we can help solve thorny versioning problems!
Problem
What’s in scope
- Generic database integration
- Commits in database that end-user manages themselves
- Metadata that helps users and/or Kedro track lineage
- Application database that extends workflow change-capture
What’s not in scope
Design
Kedro remote-object interface that I’ve focused on:
- pre-configured data catalogues
- tabular datasets
saveandloadmethods (and others)- data journaling (at the catalog layer)
I made an example Dolt integration that behaves similar to Pandas DataFrames for end-users, but uses Dolt to capture lineage and deltas of those tables for users.
Metadata storage, remotes, and advanced branching logic are all optional extensions beyond an otherwise pd.DataFrame experience.
Journaling scope limited to data catalogs, and versioning having a different meaning in Dolt are two friction points that I haven’t addressed in my sample code.
Alternatives considered
Two other integration patterns:
- Expose the Dolt database itself, that users can interact with natively
- Context manager that can “squash” the metadata log by wrapping an execution runtime.
Neither of these struck me as particularly suited to Kedro’s existing UX.
edit: SQL-server integration was mentioned as more appealing than an FS-based approach in our intro call. The two are interchangeable, FS is just easier to demo and test currently.
Testing
Explain the testing strategies to verify your design correctness (if possible). TODO
Rollout strategy
Is the change backward compatible? If not, what is the migration strategy? TODO (short answer yes)
Future iterations
Will there be future iterations of this design?
Hopefully! We are excited for feedback!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:17 (7 by maintainers)

Top Related StackOverflow Question
Hi @max-hoffman, thank you very much for taking the time to write the issue and for making the demo. Apology for the delay in response, partially because I have been in some training all week and partially because I really want to wrap my head around what exactly we are trying to accomplish here. First thing first, such an awesome piece of technology you and the Dolt team have built there. I can’t express enough how excited I am with Dolt. It feels like having a superpower I don’t yet know what to do with.
Regarding an integration with Kedro, you have touched on many great ideas in your issue and in the demo. However, please allow me to take a step back and look at this from a Kedro user perspective first. As a Kedro user, I believe I can already use Dolt right now as a data source in Kedro without any extra dataset, thanks to your SQL interface. I would use it wherever I want to track different versions of my tabular datasets. It would be an alternative option to Kedro’s path-based
VersionedDataSetfor different tabular formats, e.g. csv.The workflow is:
before_pipeline_runhook to start a Dolt SQL server and aafter_pipeline_runhook to commit the data and stop the SQL server:pandas.SQLTableDataSet. For example:And voila! If your data change between
kedro run, it’d show up as Dolt commits indolt log. For example, I have setup an example project here to demonstrate this. It’s exactly the same as a default project created with our pandas-iris starter:with a modified
hooks.pyandcatalog.ymlto integrate with Dolt as explained above*. The pipeline contains a node that splits data for training and testing purpose based on some parameters. When I run the pipeline with different train/test split ratio:there are corresponding commits in dolt:
We can now all of Dolt tools to interact with the data, e.g.
dolt diffI believe this workflow is more familiar and idiomatic to Kedro users while still showcasing the values that Dolt would bring. If you are happy with this approach, we could definitely write it up in our documentation in the section for Tools integration next to Spark. Some further ideas to improve upon this would be to allow users to checkout different data branches by passing in an extra param from the CLI, e.g.
kedro run --params dolt_branch:yesterday_dataand usedolt.checkoutprogrammatically inbefore_pipeline_runhook. The dream here would be to be able to incorporate this concept of data branches with data scientists’ experimentation tracking tools, which we also do through Hooks. Writing this up takes a bit more time so I will leave it till another day.(*) I lie a little bit here. Even though I recommend we start and stop Dolt SQL server programmtically, I actually had to do it manually in my demo project with
dolt sql-server --max-connections=10from another terminal. When I start the server from another terminal, I got the nice diff of my data as presented above. However, when I start it programmatically, the diff simply saystable deleted/table added. Do you have any idea why? OurSQLTableDataSetuses pandasread_sql_tableandto_sqlunderneath. Also thanks for fixing the--max-connectionsyesterday haha… Otherwise it was hanging for me before.I made and released a plugin here – https://github.com/dolthub/kedro-dolt.