question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run pipeline without reading from intermediate datasets

See original GitHub issue

Description

I’m always frustrated when I/O dominates compute.

For example, my pipeline takes 10 minutes to run, of which 9 minutes are spent writing to and reading back from S3.

Context

At QuantumBlack, it’s most common to write intermediate datasets to disk. In fact, the Kedro data catalog very much facilitates this workflow. This also presents numerous advantages:

  1. The ability the resume execution from any stage of the pipeline where all inputs were persisted.
  2. Ease of debugging intermediate steps of the pipeline.
  3. Transcoding, a unique behavior wherein the data changes during the write-read process.

However, it’s also extremely inefficient, especially when writing large datasets using slow mechanisms. On top of that, we most often expect reloaded data to be exactly equal to what was saved, save the case of transcoding and some terminal output formats (e.g. Excel, CSV).

Possible Implementation

https://github.com/deepyaman/hookshot/blob/develop/src/hookshot/hooks.py

Feel free to clone the repo and run the example. 😃

At a high-level, the plugin aims to provide Unix tee-like behavior to runners.

Goals of this implementation:

  1. Retain the benefits of saving to intermediate datasets (1 and 2 above).
  2. Be transparent to the user. Nobody wants to modify all their nodes to have double the outputs, nor do they want a crazy-looking Kedro-Viz.

Limitations:

  1. Doesn’t support transcoding. I think this is reasonable, as you need to write to/read from disk if your pipeline depends on transcoding. The user should likely be notified/prevented from using this if they’re transcoding. Alternatively, you could special-case those nodes and block on write-read for them.
  2. Doesn’t detect the default dataset, so it doesn’t use SharedMemoryDataSet for ParallelRunner. I would be happy to get some input from the experts here. 😃

I’m most interested in understanding what’s the best way to contribute this. I think it makes sense as part of a new kedro.extras.hooks subpackage. As part of Kedro, this functionality would continue to be supported through backend redesigns.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:5
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

5reactions
deepyamancommented, Jul 14, 2020

Sorry for the delay! I’ve put together something in my spare time, not feature complete but figured I’d share.

Let’s assume a slow filesystem with a load and save delay of 10 seconds for intermediate datasets. I haven’t added delays in nodes (to simulate nontrivial data processing) yet; an example of where this makes a better case for TeePlugin is that the last node would be executing while we wait 10 seconds at the end of the run for everything to save).

Here are executions under each strategy:

Strategy Total time Log
Baseline (i.e. no caching/plugins) 2 minutes Log
TeePlugin 10 seconds (saving all outputs) Log
CachePlugin (i.e. CachedDataSet) with is_async=True 30 seconds (saving split_data, train_model, and predict node outputs) Log

(Note that times include the initial minute delays before the pipeline begins, because of the way I added delays somehow triggering on initialization.)

The code to run these examples are in https://github.com/deepyaman/hookshot/. You can also change the load/save delays in conf/base/catalog.yml to simulate different latencies. Next steps:

  • Visualize timings
  • Parametrize node times
  • Properly package plugin/hooks (@tsanikgr you might be interested in CachePlugin as an alternative way to implement what you proposed)
  • Suggest/contrib --hooks and --async CLI options?
2reactions
deepyamancommented, Jul 2, 2020

I see. If I understand correctly, you might want to try combination of CachedDataSet and asynchronous saving (new feature of Kedro 0.16.0) explained at:

https://kedro.readthedocs.io/en/stable/04_user_guide/06_pipelines.html#asynchronous-loading-and-saving

Yes, to some extent. My implementation (https://github.com/deepyaman/hookshot/blob/develop/src/hookshot/hooks.py) is based on the code that handles the async functionality, but extended across the pipeline rather than on a per-node basis (hence the “unrolled” ThreadPoolExecutor instead of a nice little context manager).

I think the feeling I’m getting is that there are existing methods that are in the direction of what I want, but my feeling is that they don’t push it far enough. I will try to find some time to benchmark these different approaches under parametrizable conditions (READ_LATENCY, WRITE_LATENCY, READ_TIME, WRITE_TIME, etc.), in addition to creating this as a plugin.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Run pipeline without reading from intermediate datasets #420
Ease of debugging intermediate steps of the pipeline. Transcoding, a unique behavior wherein the data changes during the write-read process.
Read more >
Spark pipelines — Dataiku DSS 11 documentation
Writing intermediate datasets reduces the performance gain of using a Spark pipeline, but does not negate it since the burden of re-reading the...
Read more >
Moving data in ML pipelines - Azure Machine Learning
Only delete intermediate data after 30 days from the last change date of the data. Deleting the data earlier could cause the pipeline...
Read more >
ETL - Pipeline with intermediate storage - Michael Fuchs Python
It often happens that the data has to be loaded or read out in an 'unfavorable' format. Especially with large data sets this...
Read more >
Run a pipeline — Kedro 0.18.4 documentation - Read the Docs
SequentialRunner ¶. Use SequentialRunner to execute pipeline nodes one-by-one based on their dependencies. We recommend using SequentialRunner in cases ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found