Speed up CI test suite duration
See original GitHub issueBackground The CI test suite has more than doubled in duration in the past couple of months; this issue is a follow-up to #10636
Running pytest-profiling
resulted in the following insights:
- contrary to expectations, the fixtures in
tests/conftest.py
(which mainly have scope session) don’t actually seem to take that long, some of the longest I could find while runningtests/core
include:
_train
returned bytrained_async
fixture (233s cumulative time)e2e_bot_agent
(305s cumulative time)
- the cumulative time for some internal rasa packages when running tests/core:
rasa/core/policies/ted_policy.py::load
->1520srasa/core/policies/ted_policy.py::batch_loss
-> 1400srasa/utils/tensorflow/models::train_step
-> 1844srasa/utils/tensorflow/temp_keras_modules::fit
-> 2360s - the
pstats
for external dependencies (such as dask, tensorflow, keras) when runningtests/core
:
cumtime | percall | filename:lineno(function) |
---|---|---|
741.354 | 0.244 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/core.py:84(_execute_task) |
741.362 | 0.244 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:210(execute_task) |
741.365 | 0.244 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:234(<listcomp>) |
741.367 | 0.244 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:230(batch_execute_tasks) |
741.495 | 0.244 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:448(fire_tasks) |
741.722 | 1.418 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:346(get_async) |
741.42 | 0.244 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:535(submit) |
741.759 | 1.418 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:547(get_sync) |
673.971 | 0 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:3502(_create_op_internal) |
1138.053 | 0.001 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:320(_apply_op_helper) |
1858.289 | 4.167 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:649(wrapped_fn) |
1016.765 | 4.44 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:726(_initialize) |
2337.399 | 2.708 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:858(call) |
2337.335 | 2.708 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:904(_call) |
1948.468 | 2.788 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py:815(func_graph_from_py_func) |
1861.862 | 2.664 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py:684(wrapper) |
1846.304 | 4.254 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/keras/engine/training.py:834(run_step) |
1847.578 | 4.257 | /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/keras/engine/training.py:831(step_function) |
Additionally,test_other_unit_tests
include many subdirectories from tests/core
that could be split in more workflows, e.g. core/actions, core/channels, core/evaluation, core/nlg, core/training.
Other Notes:
- @HotThoughts is working on a DataDog CI dashboard that would show duration at test level which you can filter by branch you’re working on - this will be ready soon with a Notion page on how to use it.
- Dependency mirroring: solution for using HuggingFace (LanguageModelFeaturizer) files discussed in #10528
- Usage of dependency proxy such as
nexus
can be explored for transient issues, requires infra help - Implementing pytest profiling in DataDog for better examination of performance requires more investigation from Infra, @HotThoughts will create an issue in Jira to have this on their radar
- investigate the usage of
rasa/utils/tensorflow/temp_keras_modules::fit
method in the test suite and propose a solution to mitigate such long cumulative time? - profile individual tests in
test_policies
to determine which function calls take longest in running this workflow. - propose solution to mitigate the duration of
tensorflow
function calls? - investigate whether code coupling in model training should be fixed to improve training times?
- determine whether it’s possible / desirable to split
test_other_unit_tests
into more workflows, keeping in mind the GH concurrency limitation?
Definition of Done:
- investigate the longest running unit tests and figure out if we can make them train models quicker (e.g. decrease epochs) or avoid training altogether
- exercise good judgement: implement improvement or create issues
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
9 Ways To Make Slow Tests Faster - Semaphore CI
Time must be allocated to upkeep them because the speed of the test suite directly correlates to how often you can release software....
Read more >Improving the Performance of a Rails CI Pipeline - Kelly Sutton
From 25 Minutes to 7 Minutes: Improving the Performance of a Rails CI Pipeline. May 18, 2020. This article is also available in...
Read more >Fast fixes for slow tests: How to unclog your CI pipeline
Fast fixes for slow tests: How to unclog your CI pipeline · Remove sleeps and triage tests · Push tests down the pyramid...
Read more >Engineering Deep Dive: Four Easy Ways to Speed Up Your CI ...
Run the static tests first—they are usually faster. This way, if your code fails, you'll get faster feedback. · There is no need...
Read more >My Continuous Integration takes too much time. How do I fix it?
I've seen CI scripts that do make -j2 or execute unit tests in several threads. Containerized executors habitually start faster than ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Notion page is ready.
We haven’t migrated to JIRA. Created an issue for now: https://github.com/RasaHQ/rasa/issues/10794 🙃
Maxime Verger commented: also, I can’t find the link to the GH issue 🤔