[CI] solving the pytest crashing and hanging CI job
See original GitHub issueSo as of recent we have the run_tests_torch
CI job randomly and frequently failing.
We couldn’t find any fault with any tests because there is never a traceback, just hanging pytest
that sends no output.
This usually is a symptom that the process used more resources than it was allowed and it was killed - of course the python interpreted doesn’t get a chance to make a peep - so no traceback. e.g. on colab processes get killed in the same way.
Diagnostics
- Go to the CI report and “rerun job with SSH”
it then enables SSH and gives you the cmd to access the CI instance. Use those instructions which it shows to you in Enable SSH
to ssh to the instance.
when done remember to exit the ssh shells and Cancel Job
, since otherwise the instance will continue running at $$$.
- CI doesn’t run docker with
--priveleged
flag, so most normal system tools are disabled, so it’s almost impossible to debug anything. Things likedmesg
or/var/sys/log
are not there, you cansudo
, but you can’t do almost anything with it.
Ideally in such situations it’d be a good idea to switch from docker
back to machine
where we would have full root access.
- Resource limit
resource_class: xlarge
as of this writing gives you 16GB RAM.
This is very confusing since when you log into the instance there are 70GB of memory reported in the top. And if you try to monitor %MEM you get a very misleading low usage. It gives you the report for out of 70GB, not out of the cgroups memory limit of 16GB.
How do we know the real limit:
$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes | perl -ne 'print $_ / 2**30'
16
Yup, 16GB
- Now it’s very difficult to measure how much memory several forked processes use together, you can’t use
top
for that.
I had 2 consoles opened, one with top and another with running pytest -n 8
that I started manually
I did notice that the once all 8 processes were around 2-2.5GB RSS after awhile one of the workers crashed,
Then I found this handy tool thanks to https://unix.stackexchange.com/a/169129/291728
apt install smem
circleci@fc02c746bf66:~$ smem -t
PID User Command Swap USS PSS RSS
6 circleci /bin/sh 0 88 88 92
1 circleci /sbin/docker-init -- /bin/s 0 48 123 740
17567 circleci /usr/bin/time -v python -m 0 96 145 1216
17568 circleci tee tests_output.txt 0 140 225 1828
495 circleci /bin/bash -eo pipefail -c w 0 292 526 1692
1511 circleci -bash 0 608 1066 3140
476 circleci -bash 0 620 1079 3148
18170 circleci /usr/bin/python /usr/bin/sm 0 13160 13286 15424
7 circleci /bin/circleci-agent --confi 0 29424 29424 29428
17569 circleci python -m pytest -n 8 --dis 0 151172 163118 254684
17588 circleci /usr/local/bin/python -u -c 0 348860 371932 526452
17594 circleci /usr/local/bin/python -u -c 0 1863416 1887735 2048128
17579 circleci /usr/local/bin/python -u -c 0 2028784 2052674 2210400
17591 circleci /usr/local/bin/python -u -c 0 2031872 2056217 2214712
17574 circleci /usr/local/bin/python -u -c 0 2098124 2122054 2282392
17585 circleci /usr/local/bin/python -u -c 0 2226080 2247464 2401880
17582 circleci /usr/local/bin/python -u -c 0 2226864 2249367 2404832
17597 circleci /usr/local/bin/python -u -c 0 2643552 2665199 2818968
-------------------------------------------------------------------------------
18 1 0 15663200 15861722 17219156
The PSS column seems to be able to do correct totals on, so I did:
watch -n 1 'smem -t | tail -1'
and indeed, once the total PSS hit ~16GB pytest crashed.
The failure we get is intermittent because the tests are run randomly and sometimes we get 4 “fatter” tests run concurrently, and at all other times when it succeeds we are lucky not to hit the bad combination.
I tried to switch to:
resource_class: 2xlarge
which would give us 32GB, but apparently we aren’t allowed to do so and need to ask for a special permission from CircleCI admins.
- what happens to the hanging processes? clearly
pytest
doesn’t recover from crash. I think it can recover from other failures of its workers, but not when a kernel nukes one of its workers.
When the resource limit gets hit, all but one workers were hanging in some strange place:
Thread 0x00007f65d91bb700 (most recent call first):
File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 400 in read
File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 432 in from_io
File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 967 in _thread_receiver
File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 220 in run
File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 285 in _perform_spawn
if I look in top
7 but 1 pytest workers stop working blocking on the above.
I figured that out by adding to tests/conftest.py
:
import faulthandler
faulthandler.dump_traceback_later(20, repeat=True)
So now every 20 secs I was getting tb reports on where things were hanging…
But I’m not 100% sure it’s why they are hanging, I will have to spend more time with it if we really want to understand why the other workers stop processing. So please don’t take it as a truth, it’s just one of the possibilities to check. But since it doesn’t help our situation to understand why they can’t recover I’m not going to waste time on it.
Summary
- we probably have a very small leak that grows over hundreds of tests as the memory usage slowly, but consistently goes up
- 16GB is just enough for our
pytest -n 4
- probably 75% of the time, until we add more tests - so we either need to ask for the 2xlarge instance, or use
-n 3
- ~probably it’d be a good idea to add~ (see next comment)
apt install time
and run pytest
with:
/usr/bin/time -v python -m pytest ...
which will give us an indepth resource usage report - so overtime we should see if our test suite consumes more and more resources.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
A little bit at a time I’ve been trying to work on this issue. At the moment trying to find a reliable way to take the measurements.
I have thought of at least 3 main ways the leak could be occurring.
pytest
grow just because it loaded something new - which is a variation on (2) - but how could a test unload the libraries it loaded. It’d be very inefficient practically.Detection:
2 and 3.) these are difficult to make sense of and thus much harder to catch (2) because by just looking at numbers one doesn’t know if it was just a new library loaded, or was some object not cleaned up after the test.
Thank you for this very in-depth analysis of the situation. It would probably be helpful to have a visualization of each test and how much memory it takes, it could help in singling out memory outliers; and it could also help to detect whether we actually have a memory leak.