Run pipeline without reading from intermediate datasets
See original GitHub issueDescription
I’m always frustrated when I/O dominates compute.
For example, my pipeline takes 10 minutes to run, of which 9 minutes are spent writing to and reading back from S3.
Context
At QuantumBlack, it’s most common to write intermediate datasets to disk. In fact, the Kedro data catalog very much facilitates this workflow. This also presents numerous advantages:
- The ability the resume execution from any stage of the pipeline where all inputs were persisted.
- Ease of debugging intermediate steps of the pipeline.
- Transcoding, a unique behavior wherein the data changes during the write-read process.
However, it’s also extremely inefficient, especially when writing large datasets using slow mechanisms. On top of that, we most often expect reloaded data to be exactly equal to what was saved, save the case of transcoding and some terminal output formats (e.g. Excel, CSV).
Possible Implementation
https://github.com/deepyaman/hookshot/blob/develop/src/hookshot/hooks.py
Feel free to clone the repo and run the example. 😃
At a high-level, the plugin aims to provide Unix tee-like behavior to runners.
Goals of this implementation:
- Retain the benefits of saving to intermediate datasets (1 and 2 above).
- Be transparent to the user. Nobody wants to modify all their nodes to have double the outputs, nor do they want a crazy-looking Kedro-Viz.
Limitations:
- Doesn’t support transcoding. I think this is reasonable, as you need to write to/read from disk if your pipeline depends on transcoding. The user should likely be notified/prevented from using this if they’re transcoding. Alternatively, you could special-case those nodes and block on write-read for them.
- Doesn’t detect the default dataset, so it doesn’t use
SharedMemoryDataSetforParallelRunner. I would be happy to get some input from the experts here. 😃
I’m most interested in understanding what’s the best way to contribute this. I think it makes sense as part of a new kedro.extras.hooks subpackage. As part of Kedro, this functionality would continue to be supported through backend redesigns.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:14 (14 by maintainers)

Top Related StackOverflow Question
Sorry for the delay! I’ve put together something in my spare time, not feature complete but figured I’d share.
Let’s assume a slow filesystem with a load and save delay of 10 seconds for intermediate datasets. I haven’t added delays in nodes (to simulate nontrivial data processing) yet; an example of where this makes a better case for
TeePluginis that the last node would be executing while we wait 10 seconds at the end of the run for everything to save).Here are executions under each strategy:
TeePluginCachePlugin(i.e.CachedDataSet) withis_async=Truesplit_data,train_model, andpredictnode outputs)(Note that times include the initial minute delays before the pipeline begins, because of the way I added delays somehow triggering on initialization.)
The code to run these examples are in https://github.com/deepyaman/hookshot/. You can also change the load/save delays in
conf/base/catalog.ymlto simulate different latencies. Next steps:CachePluginas an alternative way to implement what you proposed)--hooksand--asyncCLI options?Yes, to some extent. My implementation (https://github.com/deepyaman/hookshot/blob/develop/src/hookshot/hooks.py) is based on the code that handles the async functionality, but extended across the pipeline rather than on a per-node basis (hence the “unrolled”
ThreadPoolExecutorinstead of a nice little context manager).I think the feeling I’m getting is that there are existing methods that are in the direction of what I want, but my feeling is that they don’t push it far enough. I will try to find some time to benchmark these different approaches under parametrizable conditions (
READ_LATENCY,WRITE_LATENCY,READ_TIME,WRITE_TIME, etc.), in addition to creating this as a plugin.