Doing it again…

Okay, I am trying again. Maybe for the last time, maybe not. What am I talking about? Blogging, d’uh.

I had quite a few attempts over the years, but I often struggled finding topics I wanted to write about. And that is kind of weird, since I am doing lots of things that are of interest to other people. Not all people, but at least those with similar interests.

This time I will be taking a different approach to blogging though - I’ll treat my blog as a form of personal knowledgebase: just recently I encouneterd a few problems I could swear I had ran into before - I just could not remember the solution.

So let’s get started…

Talos Linux - the Context

For a long time - probably two or three years - I’ve been running a Kubernetes cluster for my homelab. It’s a combination of Raspberry Pis and a few VMs deployed on Proxmox. I started the whole setup with k3s, initially deployed via k3os, later switching to openSUSE MicroOS. When I finally received the TuringPi RK1 modules for the TuringPi v2 I got on Kickstarter, I decided to give Talos Linux a chance. And things turned out to work significantly better than anticipated.

Now, because I already had one running Kubernetes cluster, I wanted to have some consistency between the two setups. Even though the majority of folks on the TuringPi discord were either using Flannel or Cilium as their CNI, I decided to stick with Calico and also tried enabling eBPF this time… (so much for consistency…)

It turned out that there are some interoperability issues with Calico eBPF and Talos, currently tracked in GitHub Issue #7892. But since it works directly after rebooting the nodes, I kept eBPF enabled in my setup.

Fast forward several months…

Upgrading the Tigera Operator

This week I decided to look at a few of the more basic services deployed in my Talos cluster and decided to upgrade the Tigera operator. The version change was minor, so I didn’t expect any major issue. Of course, the operator was deploying slightly newer versions of the different Calico images and that’s when things went down the drain.

When the first calico-node pod came up, it was showing the exact same issues as described in the GitHub issue mentioned above. But I had forgotten that this only happened whenever the calico-node was restarted. Instead, I thought that this was a Calico regression and I tried disabling eBPF. Due to that, the operator attempted to restart another calico-node - which did not come up properly either. Also, there were already some errors on the console of the two Talos nodes.

At that point I decided to reboot the two affected nodes, and after doing so, there were a few noteworthy things in the logs:

[   28.387980] [talos] hello failed {"component": "controller-runtime", "controller": "cluster.DiscoveryServiceController", "errr": "rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to erify certificate: x509: certificate has expired or is not yet valid: current time 1970-01-01T00:00:27Z is before 2024-06-09T09:3:00Z\"", "endpoint": "discovery.talos.dev:443"}
[  120.162505] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "eror": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: ts: internal error"}
[  135.617749] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "eror": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: ts: internal error"}
[  150.918145] [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "eror": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: ts: internal error"}

Seeing those certificate errors, I was looking for possible reasons in the logs and behold:

[   26.390416] [talos] failed looking up "time.cloudflare.com", ignored {"component": "controller-runtime", "controller": "time.yncController", "error": "lookup time.cloudflare.com on 8.8.8.8:53: dial udp 8.8.8.8:53: connect: network is unreachable"}
[   26.709146] [talos] kubelet client certificate does not match any accepted CAs, removing {"component": "cotroller-runtime", "controller": "k8s.KubeletServiceController", "verify_error": "x509: certificate has expired or is not yet vald: current time 1970-01-01T00:00:26Z is before 2024-07-24T07:05:09Z"}

So obvisouly, some certificate was being removed because the time of the node was not properly synchronized yet. And probably because of some issue due to the network stack coming up.

Man was I wrong…

The Cause

Ignoring for a moment that I tried all kinds of other stuff to fix the problem - e.g., replacing Calico with Flannel - the right thing to do at this point in time would have been to use Google to search for the last messages I was seeing in my logs. But I was so certain about the root cause, that I was only looking for answers for that. Had I done the right thing, I would’ve probably come across the Talos Troubleshooting item that specifically talks about this problem.

In all fairness, the document describes the solution in rather abstract ways and luckily I also found the answer on Stackoverflow before later coming across the Troubleshooting document.

So what is the actual root cause? kubelet issues self-signed certificates using kube-apiserver. I believe it’s those certificates that are being removed on reboot because the time is not yet synchronized. As a result, new Certificate Signing Requests (CSR) will be generated for the booting node and usually those will be automatically approved by the kubelet-serving-cert-approver which is running inside the cluster. Unfortunately, with two of the four nodes down, I had decided to cordon the other two nodes. So when the kubelet-serving-cert-approver that was running on one of the two problematic nodes was restarted, it couldn’t be scheduled for any of the remaining nodes and thus the CSRs stayed Pending.

The Solution

The solution is rather simple. First of all, we need to find the pending CSRs:

$ kubectl get csr --sort-by=.metadata.creationTimestamp

This will produce a list of CSRs, sorted by their creation date. The list will show the status of the CSRs and in my case there were several in Pending state.

To approve the pending certificates, the following has to be done for each of them:

$ kubectl certificate approve  <csr-id>

The <csr-id> should be the first column in the output produced by the first command.

That’s it. At that point, I had four nodes with Pending CSRs and upon approving them, the cluster went back into happy mode. On the Talos discord, I have been ask by Talos Support why I am using kubelet-serving-cert-approver and not trustd, which is supposed to handle this. Now, I have no idea if that actually works because I never installed kubelet-serving-cert-approver myself - apparently that was done by some Talos install. And disabling the approver definitely results in a system where I have to manually approve every node after reboot, so I am going to stick with kubelet-serving-cert-approver for a while. Maybe I’ll have another look at this when upgrading to Talos 1.8.

And that concludes the restart of my blogging for today…

Talos Cluster Lost Connectivity...

Doing it again…

Talos Linux - the Context

Upgrading the Tigera Operator

The Cause

The Solution