More Networking Woes...

More Problems

After fixing the issues introduced by enabling the Talos ingress firewall, I quickly realized that I wasn’t completely back to normal yet when I noticed that some of the deployed services had problems connecting to resources on the local network. The first thing that I noticed there was this seemed to be limited to pods with an address from the Pod CIDR; pods that had host networking configured seemed to work properly.

Assuming that this was somehow related to the recent changes, I revisited all the changes I made as part of the ingress firewall configuration and the unfortunate re-deployment of Calico:

  • The talhelper configuration, also switching off the firewall
  • Both the Tigera and Felix configurations, disabling and enabling BPF
  • Checked for any form of Global Network Policy that might be causing the problem

In addition to that, I also rebooted the whole Cluster - luckily I am still pretty much in the set up phase - as well as rebooting my internet gateway. Nothing.

Finding the Issue

After some googling, I was about to create an issue over at Calico’s GitHub for which I wanted to provide as much information as possible. When I started to collect that information, I realized that the problem was limited to one subnet in particular. All the other local networks were accessible just fine. Spending some time thinking about what’s special about this particular network, I decided to look at all the network related configurations in the cluster, when I found a MetalLB ipaddresspool that was clearly exposing IP addresses out of that problematic network CIDR - I had set that up months ago when I wanted to expose services as part of that network. Luckily, none of those were in use and I could just delete that configuration.

Because I configured MetalLB in such a way that it relies on Calico to do the BGP announcements, I knew that I had also modify the bgpconfiguration for Calico. When I was looking at that, I noticed that it announced 192.168.1.0/24 - the problematic network - via the serviceLoadBalancerIPs property. I removed that as well and everything started to work again almost instantaneously.

What’s Bothering

As always, I am glad I could resolve the issue all by myself, as this is the best way to learn things. However, in this case I don’t feel as if the problem is understood completely. The configuration I had to remove has been that way for months. It didn’t cause any issues through Calico updates, cluster reboots, or even complete restarts of network routers and switches. So it is still a mystery to me, why this has been causing issues all of a sudden.

Last modified: 2 September 2024