Slow startup because of catalog processing
See original GitHub issueDescription
When starting a Kedro pipeline with a big catalog, it can take mutliple minutes before it starts the pipeline. This time is lost parsing the catalog files. This is because for a total catalog size of 13000 entries, the code will call the function _sub_nonword_chars
100millions times.
Here is how it happens:
- When starting, Kedro call
add_feed_dict()
in_get_catalog()
. Here. This happens 1 time. add_feed_dict()
callsadd()
for each dataset. Here. This creates a FrozenDataset with all the existing key and the newly added key Here. In our case it happens 13000 times. It will happen with 1 key, then 2 keys, then 3 keys, … up to 13000 keys.- The issue is that every time you call
add
and create aFrozenDataset
, all the keys are re-processed, here. Even the keys that were already added to the previous FrozenDataset. Since we are adding 13000 datasets, this amounts to 100millions calls to_sub_nonword_chars
.
Steps to Reproduce
- Kedro run
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used: 0.17.5
- Python version used: 3.7.11
- Operating system and version: Any
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Slow startup because of catalog processing #951 - GitHub
Description When starting a Kedro pipeline with a big catalog, it can take mutliple minutes before it starts the pipeline.
Read more >The Windows 7 startup process is slow when you create many ...
In this scenario, the startup process may be very slow. Cause. This issue occurs because the boot plan for the ReadyBoot feature exceeds...
Read more >Catalog processing is slow for GRT enabled Virtual Machine ...
Catalog Processing takes more time than the previous versions. Debug logs show following messages, checking the health of NDMP Connection:.
Read more >EX180 1.2 very slow startup - EFI Communities
Everytime we reboot the Fiery, it takes around 30 minutes before we can use it again! Also, I noticed at startup a process...
Read more >Optimize Lightroom performance - Adobe Support
Let Lightroom Classic process the images before you start to work on them. Keep standard-size previews as small as possible. Because rendering ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I do think there is also an under-lying issue. The fact the we re-process the same keys over and over again. Maybe I failed to express this in the opening post.
To make it even more explicit:
{key1: val1, key2: val2, key3: val3, key4: val4}
FrozenDataset({key1: val1})
FrozenDataset({key1: val1, key2: val2})
FrozenDataset({key1: val1, key2: val2, key3: val3})
FrozenDataset({key1: val1, key2: val2, key3: val3, key4: val4})
And everytime we make such a FrozenDataset, we process all the keys with the regex. So here we would process 10 keys. Notice how this is quadratic in the number of keys in the catalog. This is why our 13k catalog entries lead to millions of calls.
I don’t think the regex is the problem. If each key was processed once, we would only need to process 13k strings instead of 200 millions.
Regarding how to profile the issue. Running a profiler on a simple
kedro run --pipeline pipeline_name
highlighted the following functions taking most of the time.Sorted by own_time descending.
As you can see
_sub_nonword_chars
gets called 106849876 times. @limdauto I did patch it on my local setup. I observed a substantial speedup and the profiling then looked like this: