dynk8s-provisioner

Commit Graph

Author	SHA1	Message	Date
Dustin	c63c4d9e8c	tf/userdata: Taint node for Jenkins only dustin/dynk8s-provisioner/pipeline/head This commit looks good Details If a Jenkins job runs for a while, Kubernetes may schedule other Pods on it eventually. If a long-running Pod gets assigned to the ephemeral node, the Cluster Autoscaler won't be able to scale down the ASG. To prevent this, we apply a taint to the node so normal Pods will not get assigned to it. We have to apply the corresponding toleration to Pods for Jenkins jobs.	2024-02-13 07:52:54 -06:00
Dustin	925d22b9d2	tf/userdata: Provision instance storage The c7gd.xlarge instance type has a directly-attached NVMe disk. Let's use it for Kubernetes Pod storage to increase performance a bit.	2024-02-13 07:50:43 -06:00
Dustin	6f279430c2	tf/asg: Use larger instance type I'd rather spend a few extra pennies on beefier ephemeral worker nodes to speed up builds.	2024-02-13 07:41:05 -06:00
Dustin	3c4f84e039	tf/userdata: Remove default CRI-O CNI config Fedora AMIs have the default locale set to en_US.UTF-8, which sorts `100-crio-bridge.conflist` before `10-calico.conflist`. As a result, Pods end up with incorrect network configuration, and cannot be reached from other Pods on the container network. Since we do not need the default configuration, the easiest way to resolve this is to just delete it.	2024-02-05 20:58:31 -06:00
Dustin	f6910f04df	tf/asg: Add CA resource tag for FUSE device plugin dustin/dynk8s-provisioner/pipeline/head This commit looks good Details Jenkins jobs that build container images in user namespaces need access to `/dev/fuse`, which is provided by the [fuse-device-plugin][0]. This plugin runs as a DaemonSet, which updates the status of the node it's running on when it starts to indicate that the FUSE device is available. When scaling up from zero nodes, Cluster Autoscaler has no way to know that this will occur, and therefore cannot determine that scaling up the ASG will create a node with the required resources. Thus, the ASG needs a tag to inform CA that the nodes it creates will indeed have the resources and scaling it up will allow the pod to be scheduled. Although this feature of CA was added in 1.14, it apparently got broken at some point and no longer works in 1.22. It works again in 1.26, though. [0]: https://github.com/kuberenetes-learning-group/fuse-device-plugin/tree/master	2024-01-14 11:42:46 -06:00
Dustin	5a79680b22	tf/userdata: Install CRI-O from Fedora base The cri-o package has moved from its own module into the base Fedora repository, as Fedora is [eliminating modules][0]. The last modular version was 1.25, which is too old to run pods with user namespaces. Version 1.26 is available in the base repository, which does support user namespaces. [0]: https://fedoraproject.org/wiki/Changes/RetireModularity	2024-01-13 10:10:46 -06:00
Dustin	473e279a18	tf/userdata: Remove default DNS configuration Lately, cloud nodes seem to be failing to come up more frequently. I traced this down to the fact that `/etc/resolv.conf` in the `kube-proxy` container contains both the AWS-provided DNS server and the on-premises server set by Wireguard. This evidently "works" correctly sometimes, but not always. When it doesn't, the `kube-proxy` cannot resolve the Kubernetes API server address, and thus cannot create the necessary netfilter rules to forward traffic correctly. This causes pods to be unable to communicate. I am not entirely sure what the "correct" solution to this problem would be, since there are various issues in play here. Fortunately, cloud nodes are only ever around for a short time, and never need to be rebooted. As such, we can use a "quick fix" and simply remove the AWS-provided DNS configuration.	2023-11-13 19:52:57 -06:00
Dustin	2f0f134223	terraform: userdata: Add Longhorn issue workaround dustin/dynk8s-provisioner/pipeline/head This commit looks good Details There's apparently a bug in open-iscsi (see [issue #4988](https://github.com/longhorn/longhorn/issues/4988)) that prevents Longhorn from working on Fedora 36+. We need a SELinux policy patch to work around it.	2023-01-10 21:09:46 -06:00
Dustin	b01841ab72	terraform: Update node template to Fedora 36 dustin/dynk8s-provisioner/pipeline/head Something is wrong with the build of this commit Details	2023-01-10 17:19:20 -06:00
Dustin	e11f98b430	terraform: Add config for auto-scaling group The Cluser Autoscaler uses EC2 Auto-Scaling Groups to configure the instances it launches when it determines additional worker nodes are necessary. Auto-Scaling Groups have an associated Launch Template, which describes the properties of the instances, such as AMI ID, instance type, security groups, etc. When instances are first launched, they need to be configured to join the on-premises Kubernetes cluster. This is handled by cloud-init using the configuration in the instance user data. The configuration supplied here specifies the Fedora packages that need to be installed on a Kubernetes worker node, plus some additional configuration required by `kubeadm`, `kubelet`, and/or `cri-o`. It also includes a script that fetches the WireGuard client configuration and connects to the VPN, finalizes the setup process, and joins the cluster.	2022-10-11 21:40:42 -05:00
Dustin	8e1165eb95	terraform: Begin AWS configuration The `terraform` directory contains the resource descriptions for all AWS services that need to be configured in order for the dynamic K8s provisioner to work. Specifically, it defines the EventBridge rule and SNS topic/subscriptions that instruct AWS to send EC2 instance state change notifications to the dynk8s-provisioner's HTTP interface.	2022-09-27 12:58:51 -05:00

11 Commits (c63c4d9e8c1963122bdc1af22d58c64d795b6edd)