question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] High CPU / Latency when adding/removing large number of routes

See original GitHub issue

Hi Everyone,

We are working on a reported behavior where a Linux VM is being used as a gateway and routes are being added/removed dynamically and in some cases, large numbers of routes at a time.

It was brought to our attention that when that happens, the waagent will spike on CPU usage and also cause a little bit of latency for a while, which sometimes last minutes until it goes back to a more stable scenario due to it processing all the routes being added.

We noticed using pings and mtr that the latency on network usually was averaging 68ms between 2 VM’s on different peered vnets and it went up to 117ms.

Currently we are testing by using a script which allows us to add/remove large numbers of routes to make it easier:

import os
import sys
from netaddr import IPNetwork

action = sys.argv[1]
num = int(sys.argv[2])
network = IPNetwork('10.1.8.0/16')
for i, cidr in enumerate(network):
    cmd = 'ip route {} {} dev eth0 src 10.0.2.4 metric 100'.format(action, cidr)
    os.system(cmd)
    if i > num:
        print('Num of routes {}'.format(num))
        break

Where you can use:

python routes.py add 1000
or
python routes.py del 1000

That would either add or remove the number of routes specified in the command.

This can be reproduced easily on any distribution, for our case, it was tested with Ubuntu 18.04 LTS and the latest waagent and kernel:

WALinuxAgent-2.2.45 running on ubuntu 18.04
Python: 3.6.9
Goal state agent: 2.2.53

Both VM's are at kernel version: 
5.4.0-1031-azure

It does look that this is being caused by the waagent having to process the amount of routes being added. The main ask here is to see if there is any way to make that less aggressive or optimize the logic so it doesn’t introduce the latency observed. As far as CPU usage, we can see if dipping in and out everytime we restart the waagent since it will process that again, for example, this is a snapshot of mpstat while restart the waagent:

image

This is the output from MTR after running it for a while. The high latency was observed for a while and that is the current problem with the report since it can affect the connectivity in these cases where routes are being added/removed dynamically. The issue is also not present with the waagent stopped (since there is no process on the routes).

image

Another point that might be interesting to explore, would be to possibly pin our process to one CPU only too, not sure what that would buy but at least other cores would be still free. This behavior is also noticed on larger VM’s, in my test I used Standard DS2_V3 VMs.

Thank you, -Marco

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ZhidongPengcommented, Jan 14, 2021

@mabicca, I will follow up the history and get back.

0reactions
narrietacommented, Mar 11, 2021

Fixed by #2156

Read more comments on GitHub >

github_iconTop Results From Across the Web

1861527 – Excessive memory and CPU usage on a router ...
In practice, the most relevant use-case where NetworkManager is unsuited is if you have a large number of IP routes, IP addresses, or...
Read more >
CSCvb30960 - Large flow introduces latency on all traffic in ...
(CPU usage is high) - Rest of Snort instances show normal CPU usage. - Latency test shows large values. Conditions: - TCP large...
Read more >
BIG-IP 15.1.2.1 Fixes and Known Issues - AskF5 - F5 Networks
Symptoms: TMM cores. Conditions: FastL4's hardware offloading is used. Because the error is an internal software logic implementation, there is no direct ...
Read more >
FortiOS Release Notes | FortiGate ...
High CPU usage due to dnsproxy process as high at 99%. 580038. Problems with cmdbsvr while handling a large number of FSSO address...
Read more >
Junos OS Release 17.3R3 for the ACX Series, EX ...
Increased number of supported routing instances instances (MX240, MX480, MX960, ... LAG interface flaps while adding/removing a VLAN —From Junos OS Release ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found