question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OS.EnableFirewall=y breaks load balanced sets probing...

See original GitHub issue

We have several Ubuntu 14.04 LTS (classic) VMs in the Azure cloud running HTTPS web services on port 443. These web services are exposed to the Internet using load balanced sets with the probe port set also to be 443. Yesterday we upgraded these VMs with security updates, including an update of walinuxagent from v2.0.14 to v2.0.16, after which these web services were no longer accessible.

After much troubleshooting we discovered that the probes sent from Azure fabric IP, 168.63.129.16, were never getting a reply from our servers, as per this tcpdump output:

01:25:06.517671 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [SEW], seq 2458085120, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
01:25:09.532881 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [SEW], seq 2458085120, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
01:25:15.532769 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [S], seq 2458085120, win 8192, options [mss 1440,nop,nop,sackOK], length 0

We then proceeded to revert the updated packages one by one and eventually found that the updated walinuxagent package was the cause of failure. Reviewing /etc/waagent.conf we found a new config options, OS.EnableFirewall, and that it was enabled. Once we disabled that option and rebooted the server (on one that had not been downgraded), the web services were accessible again as the probe requests were getting responses now:

20:57:50.482060 IP 168.63.129.16.60021 > 10.0.0.6.https: Flags [SEW], seq 2427470624, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
20:57:50.482113 IP 10.0.0.6.https > 168.63.129.16.60021: Flags [S.], seq 2514945281, ack 2427470625, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
20:57:50.482157 IP 168.63.129.16.59962 > 10.0.0.6.https: Flags [.], ack 2, win 513, length 0
20:57:50.482276 IP 168.63.129.16.60021 > 10.0.0.6.https: Flags [.], ack 1, win 513, length 0

We reviewed the commits to the waagent.conf file on GitHub and found that a recent commit, e247e7b2f23cdf2fc754f8c95161c74853334a45, had added this option and firewall rules blocking any non-root process from communicating with the fabric server 168.63.129.16. Of course our web services on port 443 are not running as root (it is a custom twisted python service running as a service user) and hence are not allowed to receive the probe from the fabric.

There was no warning about this change in any release notes, and it was enabled by default (in conflict with the comment directly above it in the config file that by default it was to be disabled). This issue cost us quite a bit of engineering time to find the solution and restore our web services. I would recommend this option be disabled by default or at least the user warned about it being enabled!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
hglkrijgercommented, Jul 16, 2018

@Suvitruf - the initial firewall rule was disabled because it was too restrictive, and hence this issue was closed. Since then we have started rolling out essentially the same functional change but with a less restrictive rule, which should not affect load balancer probes. Thanks for pointing out the comment in the config needs to be updated, I have opened #1260 for that.

0reactions
Suvitrufcommented, Jul 15, 2018

Not sure why it was closed, I’ve just deployed VM and in /etc/waagent.conf OS.EnableFirewall=y was enabled.

And still: https://github.com/Azure/WALinuxAgent/blob/master/config/ubuntu/waagent.conf#L107

The comment says that by default it should be false, but in fact it is true.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ubuntu nodes are marked offline after Azure Linux Agent ...
Assume that you have an Ubuntu operating system virtual machine (VM) that's running behind a ... OS.EnableFirewall=y breaks load balanced sets probing.
Read more >
Application Load Balancer | Amazon Web Services
Ideal for advanced load balancing of HTTP and HTTPS traffic, Application Load Balancer provides advanced request routing targeted at delivery of modern ...
Read more >
Load Balancing Algorithms, Types and Techniques - Kemp
It is most appropriate for spreading incoming client requests across a set of servers that have varying capabilities or available resources. The administrator ......
Read more >
Internal TCP/UDP Load Balancing overview - Google Cloud
The load balancer monitors VM health by using health check probes. ... Google Kubernetes Engine instances based on Container- Optimized OS implement this...
Read more >
Load balancer discovery - Product Documentation | ServiceNow
Discovery and Service Mapping can find F5 BIG-IP load balancers via SNMP, SSH, and through the REST API. HAProxy load balancer discovery.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found