How to force a certificate rotation with RKE, on a k8s cluster built with Rancher, where the certificates has expired and the Rancher cluster used to built it, has disappeared
The title is quite a mouthful, but I couldn’t figure out a shorter one that would describe what this article is about.
But let’s start at the beginning. One of my clients, who is running k8s with Rancher, with 5 active clusters, had an issue with at first one k8s cluster, and later on two, where Rancher lost the connection to them. The error message was “TLS handshake timeout” and we didn’t’ immediately pinpoint that it was the certificates that was expired. So I had a couple of sessions with the Rancher support guys (actually several) where they got the certs rotated and the clusters up and running again, sessions which taught me a bit more about RKE.
Rotating the certificates wasn’t as easy as it seemed. The k8s clusters were created by a Rancher cluster that was no longer in existence, and imported into a new one. This meant we couldn’t use the built-in certificate rotation mechanism in Rancher — not directly anyway. But because the k8s clusters were originally created with Rancher, we could actually use RKE to do the rotation.
Prepare RKE on the Rancher nodes
The first step was to login to one of the Rancher nodes and create a minimal cluster.yaml file for RKE to use:
Several options were mandatory for us:
- Ignore the Docker version. For some reason, the default Docker version in RancherOS 1.4.0 is _not_ on the list of supported Docker versions for Rancher.
- Pin the Kubernetes version. We did not want to upgrade to the latest version of Kubernetes at the same time as we fixed the expired certificates.
- cloud_provider configuration: The nodes were launched through Rancher with the VMWare cloud provider. This configuration is among other things used when you’re using VMWare VSAN as a backend for your Storage Classes. Without this configuration, your cluster can’t reach its PCV’s.
Then we had to create an SSH key and distribute the public part to all the servers. We didn’t have the keys that the clusters were created with originally, so the only way we could get access to the nodes, were through the VMWare management console. But RKE needs SSH access, so we needed to fix that.
First we created the key:
Then we distribute the keys to the nodes, by pulling them via HTTP, from an NGINX container on the Rancher node:
Now we can wget the pubkey from the cluster nodes, via the VMWare console. The easiest way is to create a script:
Fetch that script on all the k8s nodes in k8s cluster. This has to done by typing the following command manually on all nodes, through the VMWare console.
Repeat this on all the nodes.
Now we got the SSH pub key distributed to the nodes, we need to login one time, to register the nodes’ host keys. On the rancher node, type:
You need to accept the host key for every node.
Running RKE to rotate the certificates
Now we’re ready to run the `rke up` command:
When we ran the command above, we encountered another problem: RKE could not find the old certificates, and therefore not rotate them. The Rancher supporter fortunately knew what to do in that situation:
Ensure that RKE can find the old certificates
RKE apparently creates a directory, `/opt/rke/etc/kubernetes/.tmp`, that contains a copy of the certificates deployed to the cluster. For some reason that directory was missing on some of the hosts. The solution was to copy the certificates from `/opt/rke/etc/kubernetes/ssl` to `/opt/rke/etc/kubernetes/.tmp`, except for the fact that some certs were missing there too. As RKE only checks for the existence of the files, not the content, any missing cert files can be generataed by copying from one of the existing cert and key files:
In the above example it was the kube-adm component that was missing it’s cert and key.
Re-running the `rke up — config cluster.yaml` command should then rotate the certificates as expected. If you need more info than what RKE outputs as default, you can use the ` — debug` flag.