diff --git a/content/blog/speed-up-jenkins-k8s-startup.md b/content/blog/speed-up-jenkins-k8s-startup.md new file mode 100644 index 0000000..838bfbb --- /dev/null +++ b/content/blog/speed-up-jenkins-k8s-startup.md @@ -0,0 +1,96 @@ ++++ +title = 'Speed Up Jenkins Startup Time in Kubernetes' +date = 2022-12-01T21:40:17-06:00 ++++ + +I recently migrated my Jenkins server at home to run inside my Kubernetes +cluster. I am very happy with it overall; upgrades are a lot simpler, and +Longhorn volume snapshots make rolling back bad plugin updates a breeze. One +issue that troubled me for a while, though, was that it took a *really* long +time for the Jenkins server container to start. Kubernetes would list the pod +in `ContainerCreating` state for several minutes, and then in +`ContainerCreateError` for a while, before finally starting the process. It +turns out this was because of the huge number of files in the Jenkins home +directory. When the container starts up, the container runtime has to go +through every file in the persistent volume and fix its permissions. My +Jenkins instance has over 1.5 million files, so scanning and modifying them all +takes a very long time. + +I was finally able to fix this issue today, after messing with it for a week or +so. There are two changes the container runtime has to make to every file in +the persistent volume: + +1. The group ownership/GID +2. The SELinux label + +Fixing the first problem is straightforward: set +`securityContext.fsGroupChangePolicy` on the pod or container to +`OnRootMismatch`. The container runtime will check the GID of the root +directory of the persistent volume, and if it is correct, skip checking any of +the rest of the files and directories. + +The second problem was quite a bit trickier, but still fixable. It took me a +bit longer to get the solution right, but with the help of a [cri-o GitHub +issue][0], I finally managed. The key is to configure the container to have a +static SELinux context; by default, the container runtime will assign a random +category when the container starts. Naturally, this means the context labels +of all the files in the persistent volume have to be changed every time, to +match the new category. Fortunately, the +`securityContext.seLinuxOptions.level` setting on the pod/container is +available. I looked at the category of the Jenkins current process and set +`level` to that: + +```sh +ps Z -p $(pgrep -f 'jenkins\.war') +``` + +``` +LABEL PID TTY STAT TIME COMMAND +system_u:system_r:container_t:s0:c525,c600 196790 ? Sl 0:50 java -Duser.home=/var/jenkins_home -Djenkins.model.Jenkins.slaveAgentPort=50000 -Dhudson.lifecycle=hudson.lifecycle.ExitLifecycle -jar /usr/share/jenkins/jenkins.war +``` + +The *level* field is the final part of the process's label, after the last +colon. + +```yaml +spec: + containers: + - securityContext: + seLinuxOptions: + level: s0:c525,c600 +``` + +With this setting in place, the container will start with the same SELinux +context every time, so if the files are already labelled correctly, they do not +have to be changed. Unfortunately, by default, CRI-O, still walks the whole +directory tree to make sure. It can be configured to skip that step, though, +similar to the `fsGroupChangePolicy`. The pod needs a special annotation: + +```yaml +metadata: + annotations: + io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true' +``` + +CRI-O itself also has to be configured to respect that annotation. CRI-O's +configuration is not well documented, but I was able to determine that these +two lines need to be added to `/etc/crio/crio.conf`: + +```toml +[crio.runtime.runtimes.runc] +allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"] +``` + +In summary, there were four steps to configure the container runtime not to +scan and touch every file in the persistent volume when starting the Jenkins +container: + +1. Set `securityContext.fsGroupChangePolicy` to `OnRootMismatch` +2. Set `securityContext.seLinuxOptions.level` to a static value +3. Add the `io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel` annotation +4. Configure CRI-O to respect said annotation + +After completing all four steps, the Jenkins container starts up in seconds +instead of minutes. + +[0]: https://github.com/cri-o/cri-o/issues/6185