OS.EnableFirewall=y breaks load balanced sets probing...
See original GitHub issueWe have several Ubuntu 14.04 LTS (classic) VMs in the Azure cloud running HTTPS web services on port 443. These web services are exposed to the Internet using load balanced sets with the probe port set also to be 443. Yesterday we upgraded these VMs with security updates, including an update of walinuxagent
from v2.0.14 to v2.0.16, after which these web services were no longer accessible.
After much troubleshooting we discovered that the probes sent from Azure fabric IP, 168.63.129.16, were never getting a reply from our servers, as per this tcpdump
output:
01:25:06.517671 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [SEW], seq 2458085120, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
01:25:09.532881 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [SEW], seq 2458085120, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
01:25:15.532769 IP 168.63.129.16.55780 > 10.0.0.6.https: Flags [S], seq 2458085120, win 8192, options [mss 1440,nop,nop,sackOK], length 0
We then proceeded to revert the updated packages one by one and eventually found that the updated walinuxagent
package was the cause of failure. Reviewing /etc/waagent.conf
we found a new config options, OS.EnableFirewall
, and that it was enabled. Once we disabled that option and rebooted the server (on one that had not been downgraded), the web services were accessible again as the probe requests were getting responses now:
20:57:50.482060 IP 168.63.129.16.60021 > 10.0.0.6.https: Flags [SEW], seq 2427470624, win 8192, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0
20:57:50.482113 IP 10.0.0.6.https > 168.63.129.16.60021: Flags [S.], seq 2514945281, ack 2427470625, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
20:57:50.482157 IP 168.63.129.16.59962 > 10.0.0.6.https: Flags [.], ack 2, win 513, length 0
20:57:50.482276 IP 168.63.129.16.60021 > 10.0.0.6.https: Flags [.], ack 1, win 513, length 0
We reviewed the commits to the waagent.conf
file on GitHub and found that a recent commit, e247e7b2f23cdf2fc754f8c95161c74853334a45, had added this option and firewall rules blocking any non-root process from communicating with the fabric server 168.63.129.16. Of course our web services on port 443 are not running as root (it is a custom twisted python service running as a service user) and hence are not allowed to receive the probe from the fabric.
There was no warning about this change in any release notes, and it was enabled by default (in conflict with the comment directly above it in the config file that by default it was to be disabled). This issue cost us quite a bit of engineering time to find the solution and restore our web services. I would recommend this option be disabled by default or at least the user warned about it being enabled!
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
@Suvitruf - the initial firewall rule was disabled because it was too restrictive, and hence this issue was closed. Since then we have started rolling out essentially the same functional change but with a less restrictive rule, which should not affect load balancer probes. Thanks for pointing out the comment in the config needs to be updated, I have opened #1260 for that.
Not sure why it was closed, I’ve just deployed VM and in /etc/waagent.conf
OS.EnableFirewall=y
was enabled.And still: https://github.com/Azure/WALinuxAgent/blob/master/config/ubuntu/waagent.conf#L107
The comment says that by default it should be
false
, but in fact it istrue
.