question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CI] solving the pytest crashing and hanging CI job

See original GitHub issue

So as of recent we have the run_tests_torch CI job randomly and frequently failing.

We couldn’t find any fault with any tests because there is never a traceback, just hanging pytest that sends no output.

This usually is a symptom that the process used more resources than it was allowed and it was killed - of course the python interpreted doesn’t get a chance to make a peep - so no traceback. e.g. on colab processes get killed in the same way.

Diagnostics

  1. Go to the CI report and “rerun job with SSH”

it then enables SSH and gives you the cmd to access the CI instance. Use those instructions which it shows to you in Enable SSH to ssh to the instance.

when done remember to exit the ssh shells and Cancel Job, since otherwise the instance will continue running at $$$.

  1. CI doesn’t run docker with --priveleged flag, so most normal system tools are disabled, so it’s almost impossible to debug anything. Things like dmesg or /var/sys/log are not there, you can sudo, but you can’t do almost anything with it.

Ideally in such situations it’d be a good idea to switch from docker back to machine where we would have full root access.

  1. Resource limit
resource_class: xlarge

as of this writing gives you 16GB RAM.

This is very confusing since when you log into the instance there are 70GB of memory reported in the top. And if you try to monitor %MEM you get a very misleading low usage. It gives you the report for out of 70GB, not out of the cgroups memory limit of 16GB.

How do we know the real limit:

$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes | perl -ne 'print $_ / 2**30'
16

Yup, 16GB

  1. Now it’s very difficult to measure how much memory several forked processes use together, you can’t use top for that.

I had 2 consoles opened, one with top and another with running pytest -n 8 that I started manually

I did notice that the once all 8 processes were around 2-2.5GB RSS after awhile one of the workers crashed,

Then I found this handy tool thanks to https://unix.stackexchange.com/a/169129/291728

apt install smem
circleci@fc02c746bf66:~$ smem -t
  PID User     Command                         Swap      USS      PSS      RSS
    6 circleci /bin/sh                            0       88       88       92
    1 circleci /sbin/docker-init -- /bin/s        0       48      123      740
17567 circleci /usr/bin/time -v python -m         0       96      145     1216
17568 circleci tee tests_output.txt               0      140      225     1828
  495 circleci /bin/bash -eo pipefail -c w        0      292      526     1692
 1511 circleci -bash                              0      608     1066     3140
  476 circleci -bash                              0      620     1079     3148
18170 circleci /usr/bin/python /usr/bin/sm        0    13160    13286    15424
    7 circleci /bin/circleci-agent --confi        0    29424    29424    29428
17569 circleci python -m pytest -n 8 --dis        0   151172   163118   254684
17588 circleci /usr/local/bin/python -u -c        0   348860   371932   526452
17594 circleci /usr/local/bin/python -u -c        0  1863416  1887735  2048128
17579 circleci /usr/local/bin/python -u -c        0  2028784  2052674  2210400
17591 circleci /usr/local/bin/python -u -c        0  2031872  2056217  2214712
17574 circleci /usr/local/bin/python -u -c        0  2098124  2122054  2282392
17585 circleci /usr/local/bin/python -u -c        0  2226080  2247464  2401880
17582 circleci /usr/local/bin/python -u -c        0  2226864  2249367  2404832
17597 circleci /usr/local/bin/python -u -c        0  2643552  2665199  2818968
-------------------------------------------------------------------------------
   18 1                                           0 15663200 15861722 17219156

The PSS column seems to be able to do correct totals on, so I did:

watch -n 1 'smem -t | tail -1'

and indeed, once the total PSS hit ~16GB pytest crashed.

The failure we get is intermittent because the tests are run randomly and sometimes we get 4 “fatter” tests run concurrently, and at all other times when it succeeds we are lucky not to hit the bad combination.

I tried to switch to:

resource_class: 2xlarge

which would give us 32GB, but apparently we aren’t allowed to do so and need to ask for a special permission from CircleCI admins.

  1. what happens to the hanging processes? clearly pytest doesn’t recover from crash. I think it can recover from other failures of its workers, but not when a kernel nukes one of its workers.

When the resource limit gets hit, all but one workers were hanging in some strange place:

Thread 0x00007f65d91bb700 (most recent call first):
  File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 400 in read
  File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 432 in from_io
  File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 967 in _thread_receiver
  File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 220 in run
  File "/home/circleci/.local/lib/python3.7/site-packages/execnet/gateway_base.py", line 285 in _perform_spawn

if I look in top 7 but 1 pytest workers stop working blocking on the above.

I figured that out by adding to tests/conftest.py:

import faulthandler
faulthandler.dump_traceback_later(20, repeat=True)

So now every 20 secs I was getting tb reports on where things were hanging…

But I’m not 100% sure it’s why they are hanging, I will have to spend more time with it if we really want to understand why the other workers stop processing. So please don’t take it as a truth, it’s just one of the possibilities to check. But since it doesn’t help our situation to understand why they can’t recover I’m not going to waste time on it.

Summary

  1. we probably have a very small leak that grows over hundreds of tests as the memory usage slowly, but consistently goes up
  2. 16GB is just enough for our pytest -n 4 - probably 75% of the time, until we add more tests
  3. so we either need to ask for the 2xlarge instance, or use -n 3
  4. ~probably it’d be a good idea to add~ (see next comment)
apt install time

and run pytest with:

/usr/bin/time -v python -m pytest ...

which will give us an indepth resource usage report - so overtime we should see if our test suite consumes more and more resources.

@LysandreJik, @sgugger

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, May 3, 2021

A little bit at a time I’ve been trying to work on this issue. At the moment trying to find a reliable way to take the measurements.

I have thought of at least 3 main ways the leak could be occurring.

  1. leak in some API
  2. leak in badly written test that doesn’t clean up after itself - so some object is created in the test and somehow it doesn’t get destroyed (see also 3)
  3. “functional leak” as a side-effect of loading extra libraries - say we have 10 tests each loading 10 different libraries - each test will then make pytest grow just because it loaded something new - which is a variation on (2) - but how could a test unload the libraries it loaded. It’d be very inefficient practically.

Detection:

  1. should be easy to detect by re-running the same test and noticing memory grow. My current algorithm - is to run the test once, ignore the memory usage because it could be loading a new module/lib, and run it second time to notice any difference here.

2 and 3.) these are difficult to make sense of and thus much harder to catch (2) because by just looking at numbers one doesn’t know if it was just a new library loaded, or was some object not cleaned up after the test.

1reaction
LysandreJikcommented, Apr 30, 2021

Thank you for this very in-depth analysis of the situation. It would probably be helpful to have a visualization of each test and how much memory it takes, it could help in singling out memory outliers; and it could also help to detect whether we actually have a memory leak.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Constant freezes in the CI (#215440) · Issues - GitLab.org
Things I have tried: Decrease CI timeout; Use verbose mode where possible; Restart jobs (works sometimes); Restart jobs and clear runner cache ( ......
Read more >
How to not mark Jenkins job as FAILURE when pytest tests fail
Sometimes a test fails and sometimes the test environment crashes (random HTTP timeout, external lib error, etc.). The job parses the XML test ......
Read more >
Troubleshooting — pytest-qt documentation - Read the Docs
pytest -qt needs a DISPLAY to run, otherwise Qt calls abort() and the process crashes immediately. One solution is to use the pytest-xvfb...
Read more >
AutoCAD Products freeze, hang, or crash during startup
Contact a computer support professional for further assistance diagnosing and resolving conflicts. If AutoCAD does not run correctly in Windows ...
Read more >
Contributing — scikit-learn 1.2.0 documentation
Crash Course in Contributing to Scikit-Learn & Open Source Projects: Video, ... The CI will also build the docs: please refer to Generated...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found