Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run pipeline without reading from intermediate datasets

See original GitHub issue

Description

I’m always frustrated when I/O dominates compute.

For example, my pipeline takes 10 minutes to run, of which 9 minutes are spent writing to and reading back from S3.

Context

At QuantumBlack, it’s most common to write intermediate datasets to disk. In fact, the Kedro data catalog very much facilitates this workflow. This also presents numerous advantages:

The ability the resume execution from any stage of the pipeline where all inputs were persisted.
Ease of debugging intermediate steps of the pipeline.
Transcoding, a unique behavior wherein the data changes during the write-read process.

However, it’s also extremely inefficient, especially when writing large datasets using slow mechanisms. On top of that, we most often expect reloaded data to be exactly equal to what was saved, save the case of transcoding and some terminal output formats (e.g. Excel, CSV).

Possible Implementation

https://github.com/deepyaman/hookshot/blob/develop/src/hookshot/hooks.py

Feel free to clone the repo and run the example. 😃

At a high-level, the plugin aims to provide Unix tee-like behavior to runners.

Goals of this implementation:

Retain the benefits of saving to intermediate datasets (1 and 2 above).
Be transparent to the user. Nobody wants to modify all their nodes to have double the outputs, nor do they want a crazy-looking Kedro-Viz.

Limitations:

Doesn’t support transcoding. I think this is reasonable, as you need to write to/read from disk if your pipeline depends on transcoding. The user should likely be notified/prevented from using this if they’re transcoding. Alternatively, you could special-case those nodes and block on write-read for them.
Doesn’t detect the default dataset, so it doesn’t use SharedMemoryDataSet for ParallelRunner. I would be happy to get some input from the experts here. 😃

I’m most interested in understanding what’s the best way to contribute this. I think it makes sense as part of a new kedro.extras.hooks subpackage. As part of Kedro, this functionality would continue to be supported through backend redesigns.

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:14 (14 by maintainers)

Top GitHub Comments

5reactions

deepyamancommented, Jul 14, 2020

Sorry for the delay! I’ve put together something in my spare time, not feature complete but figured I’d share.

Let’s assume a slow filesystem with a load and save delay of 10 seconds for intermediate datasets. I haven’t added delays in nodes (to simulate nontrivial data processing) yet; an example of where this makes a better case for TeePlugin is that the last node would be executing while we wait 10 seconds at the end of the run for everything to save).

Here are executions under each strategy:

Strategy	Total time	Log
Baseline (i.e. no caching/plugins)	2 minutes	Log
`TeePlugin`	10 seconds (saving all outputs)	Log
`CachePlugin` (i.e. `CachedDataSet`) with `is_async=True`	30 seconds (saving `split_data`, `train_model`, and `predict` node outputs)	Log

(Note that times include the initial minute delays before the pipeline begins, because of the way I added delays somehow triggering on initialization.)

The code to run these examples are in https://github.com/deepyaman/hookshot/. You can also change the load/save delays in conf/base/catalog.yml to simulate different latencies. Next steps:

Visualize timings
Parametrize node times
Properly package plugin/hooks (@tsanikgr you might be interested in CachePlugin as an alternative way to implement what you proposed)
Suggest/contrib --hooks and --async CLI options?

2reactions

deepyamancommented, Jul 2, 2020

I see. If I understand correctly, you might want to try combination of CachedDataSet and asynchronous saving (new feature of Kedro 0.16.0) explained at:

https://kedro.readthedocs.io/en/stable/04_user_guide/06_pipelines.html#asynchronous-loading-and-saving

Yes, to some extent. My implementation (https://github.com/deepyaman/hookshot/blob/develop/src/hookshot/hooks.py) is based on the code that handles the async functionality, but extended across the pipeline rather than on a per-node basis (hence the “unrolled” ThreadPoolExecutor instead of a nice little context manager).

I think the feeling I’m getting is that there are existing methods that are in the direction of what I want, but my feeling is that they don’t push it far enough. I will try to find some time to benchmark these different approaches under parametrizable conditions (READ_LATENCY, WRITE_LATENCY, READ_TIME, WRITE_TIME, etc.), in addition to creating this as a plugin.

Top Results From Across the Web

Run pipeline without reading from intermediate datasets #420

Ease of debugging intermediate steps of the pipeline. Transcoding, a unique behavior wherein the data changes during the write-read process.

Spark pipelines — Dataiku DSS 11 documentation

Writing intermediate datasets reduces the performance gain of using a Spark pipeline, but does not negate it since the burden of re-reading the...

Moving data in ML pipelines - Azure Machine Learning

Only delete intermediate data after 30 days from the last change date of the data. Deleting the data earlier could cause the pipeline...

ETL - Pipeline with intermediate storage - Michael Fuchs Python

It often happens that the data has to be loaded or read out in an 'unfavorable' format. Especially with large data sets this...

Run a pipeline — Kedro 0.18.4 documentation - Read the Docs

SequentialRunner ¶. Use SequentialRunner to execute pipeline nodes one-by-one based on their dependencies. We recommend using SequentialRunner in cases ...