So I’ve been trying to set up the Ingress Firewall for Talos as described in the documentation. To play it safe, I started with the firewall for the worker nodes, since it would be easier to recover from any problem if I screwed up the configuration. Initilly, everything looked just fine and I rolled out the changes to all the nodes. Again, I was wrong…
Turns out that it’s especially important to read the documentation carefully, especially when working with firewalls. While it had been pretty clear about where to use CLUSTER_SUBNET, I had misinterpreted it as CLUSTER_CIDR which obviously did not yield the same result. Once that was fixed - which required resetting all worker nodes - I was ready to also apply the firewall changes to the control plane nodes.
Today, luckily just a day after my firewall work, I’ve been working on my ArgoCD configuration, which resulted in an update to my Calico configuration. When that happens, the Tigera operator will its DaemonSets, everything will restart, and that did not end up in a happy place.
The first problem, that I ran into was calico-node pods not comming up properly. Looking at the logs, I noticed the following:
/etc/calico/confd/config/bird.cfg: No such file or directory
/etc/calico/confd/config/bird6.cfg: No such file or directory
Those lines were output a lot. After some googling, everything was pointing to the Typha instances. While they seemed to be running properly, I notived log output that indicated a communication problems. Why? Because calico-node would try to contact Typha via its CLUSTER_SUBNET address on port 5473, which was blocked by the firewall rules. Adding another rule for both worker and control plane nodes fixed that, and I though I’d be done. Not so fast…
The second thing that happened was calico-node pods not becoming ready on some of my cluster nodes. Since I am a heavy user of k9s that was easily spotted because those items were just stuck being shown in red. When I looked at the Readiness Probe configuration, I noticed that the followng command would be executed to determine the state:
$ calico-node -felix-ready -bird-ready
Shelling into one of the non-ready pods, I was able to get some output that helped move things into the right direction: BGP. The Bird process is using BGP between the different Calico nodes, and that again requires another port to be opened up: the one for BGP, which is 179.
While the documentation for the Talos IngressFirewall points out that UDP 4789 needs to be opened up in order for Calico’s VXLAN to work, unfortunately it does not mention TCP 179 for BGP, not 5473 for Typha.