question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Speed up CI test suite duration

See original GitHub issue

Background The CI test suite has more than doubled in duration in the past couple of months; this issue is a follow-up to #10636

Running pytest-profiling resulted in the following insights:

  • contrary to expectations, the fixtures in tests/conftest.py (which mainly have scope session) don’t actually seem to take that long, some of the longest I could find while running tests/core include:
  1. _train returned by trained_async fixture (233s cumulative time)
  2. e2e_bot_agent (305s cumulative time)
  • the cumulative time for some internal rasa packages when running tests/core: rasa/core/policies/ted_policy.py::load ->1520s rasa/core/policies/ted_policy.py::batch_loss -> 1400s rasa/utils/tensorflow/models::train_step -> 1844s rasa/utils/tensorflow/temp_keras_modules::fit -> 2360s
  • the pstats for external dependencies (such as dask, tensorflow, keras) when running tests/core:
cumtime percall filename:lineno(function)
741.354 0.244 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/core.py:84(_execute_task)
741.362 0.244 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:210(execute_task)
741.365 0.244 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:234(<listcomp>)
741.367 0.244 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:230(batch_execute_tasks)
741.495 0.244 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:448(fire_tasks)
741.722 1.418 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:346(get_async)
741.42 0.244 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:535(submit)
741.759 1.418 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/dask/local.py:547(get_sync)
673.971 0 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/ops.py:3502(_create_op_internal)
1138.053 0.001 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py:320(_apply_op_helper)
1858.289 4.167 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:649(wrapped_fn)
1016.765 4.44 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:726(_initialize)
2337.399 2.708 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:858(call)
2337.335 2.708 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py:904(_call)
1948.468 2.788 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py:815(func_graph_from_py_func)
1861.862 2.664 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py:684(wrapper)
1846.304 4.254 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/keras/engine/training.py:834(run_step)
1847.578 4.257 /Users/ancalita/rasa-projects/oss-0122/lib/python3.8/site-packages/keras/engine/training.py:831(step_function)

Additionally,test_other_unit_tests include many subdirectories from tests/core that could be split in more workflows, e.g. core/actions, core/channels, core/evaluation, core/nlg, core/training.

Other Notes:

  • @HotThoughts is working on a DataDog CI dashboard that would show duration at test level which you can filter by branch you’re working on - this will be ready soon with a Notion page on how to use it.
  • Dependency mirroring: solution for using HuggingFace (LanguageModelFeaturizer) files discussed in #10528
  • Usage of dependency proxy such as nexus can be explored for transient issues, requires infra help
  • Implementing pytest profiling in DataDog for better examination of performance requires more investigation from Infra, @HotThoughts will create an issue in Jira to have this on their radar
  • investigate the usage of rasa/utils/tensorflow/temp_keras_modules::fit method in the test suite and propose a solution to mitigate such long cumulative time?
  • profile individual tests in test_policies to determine which function calls take longest in running this workflow.
  • propose solution to mitigate the duration of tensorflow function calls?
  • investigate whether code coupling in model training should be fixed to improve training times?
  • determine whether it’s possible / desirable to split test_other_unit_tests into more workflows, keeping in mind the GH concurrency limitation?

Definition of Done:

  • investigate the longest running unit tests and figure out if we can make them train models quicker (e.g. decrease epochs) or avoid training altogether
  • exercise good judgement: implement improvement or create issues

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
HotThoughtscommented, Feb 1, 2022

Notion page is ready.

We haven’t migrated to JIRA. Created an issue for now: https://github.com/RasaHQ/rasa/issues/10794 🙃

0reactions
z-bodikovacommented, Feb 2, 2022

Maxime Verger commented: also, I can’t find the link to the GH issue 🤔

Read more comments on GitHub >

github_iconTop Results From Across the Web

9 Ways To Make Slow Tests Faster - Semaphore CI
Time must be allocated to upkeep them because the speed of the test suite directly correlates to how often you can release software....
Read more >
Improving the Performance of a Rails CI Pipeline - Kelly Sutton
From 25 Minutes to 7 Minutes: Improving the Performance of a Rails CI Pipeline. May 18, 2020. This article is also available in...
Read more >
Fast fixes for slow tests: How to unclog your CI pipeline
Fast fixes for slow tests: How to unclog your CI pipeline · Remove sleeps and triage tests · Push tests down the pyramid...
Read more >
Engineering Deep Dive: Four Easy Ways to Speed Up Your CI ...
Run the static tests first—they are usually faster. This way, if your code fails, you'll get faster feedback. · There is no need...
Read more >
My Continuous Integration takes too much time. How do I fix it?
I've seen CI scripts that do make -j2 or execute unit tests in several threads. Containerized executors habitually start faster than ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found