Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow startup because of catalog processing

See original GitHub issue

Description

When starting a Kedro pipeline with a big catalog, it can take mutliple minutes before it starts the pipeline. This time is lost parsing the catalog files. This is because for a total catalog size of 13000 entries, the code will call the function _sub_nonword_chars 100millions times.

Here is how it happens:

When starting, Kedro call add_feed_dict() in _get_catalog(). Here. This happens 1 time.
add_feed_dict() calls add() for each dataset. Here. This creates a FrozenDataset with all the existing key and the newly added key Here. In our case it happens 13000 times. It will happen with 1 key, then 2 keys, then 3 keys, … up to 13000 keys.
The issue is that every time you call add and create a FrozenDataset, all the keys are re-processed, here. Even the keys that were already added to the previous FrozenDataset. Since we are adding 13000 datasets, this amounts to 100millions calls to _sub_nonword_chars.

Steps to Reproduce

Kedro run

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

Kedro version used: 0.17.5
Python version used: 3.7.11
Operating system and version: Any

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:11 (7 by maintainers)

Top GitHub Comments

2reactions

Rodolphe-cambiercommented, Oct 12, 2021

I do think there is also an under-lying issue. The fact the we re-process the same keys over and over again. Maybe I failed to express this in the opening post.

To make it even more explicit:

If you have a catalog with 4 keys: {key1: val1, key2: val2, key3: val3, key4: val4}
Then the code will create 4 FrozenDatasets.
- first FrozenDataset({key1: val1})
- then FrozenDataset({key1: val1, key2: val2})
- then FrozenDataset({key1: val1, key2: val2, key3: val3})
- then FrozenDataset({key1: val1, key2: val2, key3: val3, key4: val4})

And everytime we make such a FrozenDataset, we process all the keys with the regex. So here we would process 10 keys. Notice how this is quadratic in the number of keys in the catalog. This is why our 13k catalog entries lead to millions of calls.

I don’t think the regex is the problem. If each key was processed once, we would only need to process 13k strings instead of 200 millions.

2reactions

Rodolphe-cambiercommented, Oct 12, 2021

Regarding how to profile the issue. Running a profiler on a simple kedro run --pipeline pipeline_name highlighted the following functions taking most of the time.

Sorted by own_time descending.

function_name call_count time(time) own_time(ms)
<method 'sub' of 're.Pattern' objects>	    106849876	302915	302898
_compile	                            106829650	55367	39719
get_data	                            6806	38226	38029
sub	                                    106804688	391190	33344
<dictcomp>	                            12992	455294	32677
_sub_nonword_chars	                    106800726	422711	31564
<built-in method builtins.isinstance>	    110328963	15883	15810
<built-in method time.sleep>	            111	11336	11336
<method 'recv_into' of '_socket.socket' objects>	1191	9296	9296
<built-in method _imp.create_dynamic>	    361	7686	7671
<method 'update' of 'dict' objects>	    69905	4050	4050
<built-in method io.open>	            2408	4114	3957
add	                                    12990	462070	2695

As you can see _sub_nonword_chars gets called 106849876 times. @limdauto I did patch it on my local setup. I observed a substantial speedup and the profiling then looked like this:

function_name call_count time(time) own_time(ms)
<dictcomp>	                                        12992	15782	15659
<method 'recv_into' of '_socket.socket' objects>	1192	7851	7851
get_data	                                        6805	4198	4078
<method 'update' of 'dict' objects>	                69906	2481	2481
<built-in method _imp.create_dynamic>	                361	1997	1994
<built-in method time.sleep>	                        15	1542	1542
<built-in method io.open>	                        2408	1572	1427
add	                                                12990	19459	1030