Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Speed up CI test suite duration

See original GitHub issue

Background The CI test suite has more than doubled in duration in the past couple of months; this issue is a follow-up to #10636

Running pytest-profiling resulted in the following insights:

contrary to expectations, the fixtures in tests/conftest.py (which mainly have scope session) don’t actually seem to take that long, some of the longest I could find while running tests/core include:

_train returned by trained_async fixture (233s cumulative time)
e2e_bot_agent (305s cumulative time)

the cumulative time for some internal rasa packages when running tests/core: rasa/core/policies/ted_policy.py::load ->1520s rasa/core/policies/ted_policy.py::batch_loss -> 1400s rasa/utils/tensorflow/models::train_step -> 1844s rasa/utils/tensorflow/temp_keras_modules::fit -> 2360s
the pstats for external dependencies (such as dask, tensorflow, keras) when running tests/core:

cumtime	percall	filename:lineno(function)
741.354	0.244	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/core.py:84(_execute_task)
741.362	0.244	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:210(execute_task)
741.365	0.244	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:234(<listcomp>)
741.367	0.244	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:230(batch_execute_tasks)
741.495	0.244	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:448(fire_tasks)
741.722	1.418	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:346(get_async)
741.42	0.244	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:535(submit)
741.759	1.418	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:547(get_sync)
673.971	0	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:3502(_create_op_internal)
1138.053	0.001	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:320(_apply_op_helper)
1858.289	4.167	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:649(wrapped_fn)
1016.765	4.44	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:726(_initialize)
2337.399	2.708	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:858(call)
2337.335	2.708	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:904(_call)
1948.468	2.788	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py:815(func_graph_from_py_func)
1861.862	2.664	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py:684(wrapper)
1846.304	4.254	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/keras/engine/training.py:834(run_step)
1847.578	4.257	/Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/keras/engine/training.py:831(step_function)

Additionally,test_other_unit_tests include many subdirectories from tests/core that could be split in more workflows, e.g. core/actions, core/channels, core/evaluation, core/nlg, core/training.

Other Notes:

@HotThoughts is working on a DataDog CI dashboard that would show duration at test level which you can filter by branch you’re working on - this will be ready soon with a Notion page on how to use it.
Dependency mirroring: solution for using HuggingFace (LanguageModelFeaturizer) files discussed in #10528
Usage of dependency proxy such as nexus can be explored for transient issues, requires infra help
Implementing pytest profiling in DataDog for better examination of performance requires more investigation from Infra, @HotThoughts will create an issue in Jira to have this on their radar
investigate the usage of rasa/utils/tensorflow/temp_keras_modules::fit method in the test suite and propose a solution to mitigate such long cumulative time?
profile individual tests in test_policies to determine which function calls take longest in running this workflow.
propose solution to mitigate the duration of tensorflow function calls?
investigate whether code coupling in model training should be fixed to improve training times?
determine whether it’s possible / desirable to split test_other_unit_tests into more workflows, keeping in mind the GH concurrency limitation?

Definition of Done:

investigate the longest running unit tests and figure out if we can make them train models quicker (e.g. decrease epochs) or avoid training altogether
exercise good judgement: implement improvement or create issues