Memory leak when utilizing Store SCP behind a load balancer with health checks
See original GitHub issueDescribe the bug
We are currently utilizing 8 Store SCP processes across 8 servers (1 process per server) all behind a loadbalancer.org load balancer. We noticed that, over time, our Store SCP processes memory usage would balloon to the point where it would start affecting other processes on the machine and we’d have to restart the processes. After testing and investigating, it was discovered to have been caused by the health checks from the load balancer which was set to Connect to port - Attempt to make a connection to the specified port.
.
With the health checking enabled, this was the resulting memory snapshot of one of the Store SCP processes after approximately 65 hours of idle time:
Aug 23 09:53:48 joints-io-1 (2418585) stord [INFO] Top 10 memory offenders for stord process:
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/site-packages/pydicom/uid.py:69: size=88.2 MiB, count=1229712, average=75 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/threading.py:909: size=16.0 MiB, count=1, average=16.0 MiB
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/_weakrefset.py:84: size=8247 KiB, count=645, average=12.8 KiB
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/copy.py:290: size=6507 KiB, count=83291, average=80 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/copy.py:216: size=5846 KiB, count=44114, average=136 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/copy.py:180: size=3271 KiB, count=44538, average=75 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/threading.py:238: size=2888 KiB, count=9661, average=306 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/copyreg.py:88: size=2744 KiB, count=43893, average=64 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/copy.py:238: size=1582 KiB, count=19891, average=81 B
Aug 23 09:53:51 joints-io-1 (2418585) stord [INFO] /usr/local/lib/python3.7/copy.py:275: size=698 KiB, count=10147, average=70 B
top
output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2418586 joints 20 0 3637100 1.6g 8796 S 7.6 13.5 302:44.39 /usr/local/bin/stord
The memory ballooning is alleviated if health checks from our load balancer are disabled.
Expected behavior The hope is that health check port connections can be handled properly so as not to balloon process’ memory to harmful levels.
Steps To Reproduce Example python code being run:
from pynetdicom import AE, AllStoragePresentationContexts, evt
# Transfer Syntaxes we support
IMPLICIT_VR_LE = '1.2.840.10008.1.2'
EXPLICIT_VR_LE = '1.2.840.10008.1.2.1'
EXPLICIT_VR_BE = '1.2.840.10008.1.2.2'
IMPLICIT_VR_LE_JPEG_BASELINE = '1.2.840.10008.1.2.4.50'
IMPLICIT_VR_LE_JPEG_EXTENDED = '1.2.840.10008.1.2.4.51'
IMPLICIT_VR_LE_JPEG_PROG_10_12 = '1.2.840.10008.1.2.4.55'
IMPLICIT_VR_LE_JPEG_LOSSLESS = '1.2.840.10008.1.2.4.57'
IMPLICIT_VR_LE_JPEG_LOSSLESS_FO = '1.2.840.10008.1.2.4.70'
IMPLICIT_VR_LE_JPEGLS_LOSSLESS = '1.2.840.10008.1.2.4.80'
IMPLICIT_VR_LE_JPEGLS_LOSSY = '1.2.840.10008.1.2.4.81'
IMPLICIT_VR_LE_JPEG2K_LOSSLESS = '1.2.840.10008.1.2.4.90'
IMPLICIT_VR_LE_JPEG2K_LOSSY = '1.2.840.10008.1.2.4.91'
IMPLICIT_VR_LE_RLE_LOSSLESS = '1.2.840.10008.1.2.5'
ACCEPTED_TRANSFER_SYNTAXES = (
IMPLICIT_VR_LE,
EXPLICIT_VR_LE,
EXPLICIT_VR_BE,
IMPLICIT_VR_LE_JPEG_BASELINE,
IMPLICIT_VR_LE_JPEG_EXTENDED,
IMPLICIT_VR_LE_JPEG_PROG_10_12,
IMPLICIT_VR_LE_JPEG_LOSSLESS,
IMPLICIT_VR_LE_JPEG_LOSSLESS_FO,
IMPLICIT_VR_LE_JPEGLS_LOSSLESS,
IMPLICIT_VR_LE_JPEGLS_LOSSY,
IMPLICIT_VR_LE_JPEG2K_LOSSLESS,
IMPLICIT_VR_LE_JPEG2K_LOSSY,
IMPLICIT_VR_LE_RLE_LOSSLESS,
)
ae = AE('MEDSTRAT')
port = 11112
contexts = AllStoragePresentationContexts
transferSyntaxes = list(ACCEPTED_TRANSFER_SYNTAXES)
storeHandler = CSTOREHandler()
evtHandlers = (
(evt.EVT_C_STORE, storeHandler.handleEvent),
)
for context in contexts:
context.transfer_syntax = transferSyntaxes
ae.supported_contexts = contexts
ae.maximum_pdu_size = 0
ae.maximum_associations = 40
logger.info(f'Starting Storage service on port {port} (max threads: {ae.maximum_associations})')
scp = ae.start_server(('', port), block = False, evt_handlers = evtHandlers)
Your environment
[root@joints-io-1 ~]$ python -c "import platform; print(platform.platform())"
Linux-5.4.106-1-pve-x86_64-with-centos-8.3.2011
[root@joints-io-1 ~]$ python -c "import sys; print('Python ', sys.version)"
Python 3.7.9 (default, May 30 2021, 14:22:29)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]
[root@joints-io-1 ~]$ python -c "import pydicom; print('pydicom ', pydicom.__version__)"
pydicom 1.4.2
[root@joints-io-1 ~]$ python -c "import pynetdicom; print('pynetdicom ', pynetdicom.__version__)"
pynetdicom 1.5.6
We’re using loadbalancer.org’s Virtual ADC (https://www.loadbalancer.org/products/virtual/enterprise-va-1g/). The version we’re currently running is v8.5.2 with Layer 4 load balancing set to Direct Routing
as the Forwarding Method
with Health Check Type Connect to Port
.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
OK, should be fixed now. As far as I can tell the problem was due to dead Association threads not being garbage collected automatically, so I’ve added a call to
gc.collect()
that should run every 30 s or so.Here is a the packet dump from wireshark:
tcpdump104.txt