[BUG] High CPU / Latency when adding/removing large number of routes
See original GitHub issueHi Everyone,
We are working on a reported behavior where a Linux VM is being used as a gateway and routes are being added/removed dynamically and in some cases, large numbers of routes at a time.
It was brought to our attention that when that happens, the waagent will spike on CPU usage and also cause a little bit of latency for a while, which sometimes last minutes until it goes back to a more stable scenario due to it processing all the routes being added.
We noticed using pings and mtr that the latency on network usually was averaging 68ms between 2 VM’s on different peered vnets and it went up to 117ms.
Currently we are testing by using a script which allows us to add/remove large numbers of routes to make it easier:
import os
import sys
from netaddr import IPNetwork
action = sys.argv[1]
num = int(sys.argv[2])
network = IPNetwork('10.1.8.0/16')
for i, cidr in enumerate(network):
cmd = 'ip route {} {} dev eth0 src 10.0.2.4 metric 100'.format(action, cidr)
os.system(cmd)
if i > num:
print('Num of routes {}'.format(num))
break
Where you can use:
python routes.py add 1000
or
python routes.py del 1000
That would either add or remove the number of routes specified in the command.
This can be reproduced easily on any distribution, for our case, it was tested with Ubuntu 18.04 LTS and the latest waagent and kernel:
WALinuxAgent-2.2.45 running on ubuntu 18.04
Python: 3.6.9
Goal state agent: 2.2.53
Both VM's are at kernel version:
5.4.0-1031-azure
It does look that this is being caused by the waagent having to process the amount of routes being added. The main ask here is to see if there is any way to make that less aggressive or optimize the logic so it doesn’t introduce the latency observed. As far as CPU usage, we can see if dipping in and out everytime we restart the waagent since it will process that again, for example, this is a snapshot of mpstat while restart the waagent:
This is the output from MTR after running it for a while. The high latency was observed for a while and that is the current problem with the report since it can affect the connectivity in these cases where routes are being added/removed dynamically. The issue is also not present with the waagent stopped (since there is no process on the routes).
Another point that might be interesting to explore, would be to possibly pin our process to one CPU only too, not sure what that would buy but at least other cores would be still free. This behavior is also noticed on larger VM’s, in my test I used Standard DS2_V3 VMs.
Thank you, -Marco
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (2 by maintainers)
Top GitHub Comments
@mabicca, I will follow up the history and get back.
Fixed by #2156