Setting Kubernetes Cluster

Posted on Jan 19, 2023

Background

A while ago I wanted to learn how to set up my own Kubetnetes cluster. I quickly found that there were wasn’t a single “right” way to do it. I was looking for a guide that would walk me through the steps of standing up a generic cluster with the most commonly used components. However, each guide I found was always different and tended to rely on tools to bootstrap the cluster and left it there.

The goal of this post is to manually set up a high availability k8 cluster, which will give you a starting point to add other things such as persistent storage. In addition to guiding you through the process I will include additional technical details beyond the typical “run x, y, z commands, now you have a cluster!” format. This post will not cover anything else beyond setting up a cluster, basic troubleshooting, and technical information. There are a lot of projects you can add and ways to configure the cluster. Creating a simple cluster and building on that will help me and you understand Kubernetes better.

Before I get into this, I want to address the difference between k8s and k3s. Most of the tutorials and guides I have found online use k3s and don’t explain why. The short of it is that k3s is smaller, requires less resources, can run on a lot more hardware, and is faster. K3s only requires 1 CPU and 512MB of RAM because it does not have extra storage providers, alpha features, and legacy components.

The advantages of going with k3s over k8s are:

It’s lightweight, requiring less resources while still providing 100% compatibility and functionality
There’s less to learn because the excess stuff has been removed
It can work with k8s clusters
Faster and easier to deploy
Can run on the ARM architecture
Support a single node
Flexible
Easily turn it off and on

There isn’t much of a disadvantage to using k3s:

Does not come with a distributed database, which limits the control plane’s high availability
Does not support any other database engine other than etcd
Will need to use an external database for high availability

So what’s the difference between k3s and k8s?

k3s is faster and lighter k8s runs components in separate processes, while all the components are in the single 40-100MB k3s binary.
Cannot switch off embedded components in k8s – to make it lightweight.

When should you use k3s?

Need something lightweight
Only have a single node cluster running locally or on edge devices
Need to support or have multiple CPU architectures
On premises cluster, and have no need for cloud provider extensions
Need to spin up jobs for cloud bursting, CI testing, etc.
Need to frequently scale

There isn’t much of a reason why you would choose k8s over k3s. If you need the extra features including alpha, and providers that require high availability where data is spread out between multiple clusters and cloud providers, then go with k8s.

Setting up a Highly Available Cluster

Requirements

There are several ways of setting up a Kubernetes cluster, but we will be doing this manually so there is a deeper understanding about how everything comes together. The method of bootstrapping a cluster is to use the kubeadm toolbox. In order to make things easier, we’ll be doing this using virtual machines. The following is required for installing k8s on either a virtual machine or a physical machine:

Compatible Linux host
2 or more GB RAM, the more the better in a production environment
2 or more CPUs
Network connectivity between all machines in the cluster
- If the system has multiple network cards, make sure routes are set up properly.
- Port 6443 is open
Unique host name, MAC address, and product_uuid for every node
- cat sysclass/dmi/id/product_uuid
Container runtime
Kubernetes tools:
- kubeadm: toolbox for bootstrapping a cluster
- kubelet: component that runs on all machines in the cluster and that manages pods and containers.
- kubectl: Utility to manage the cluster
cgroup driver

For our virtual machine cluster we need the following:

Enough system resources for 6 virtual machines (2 CPUs and 2GB of RAM).
Vagrant
VirtualBox

Container Runtime Interface

A container runtime is required to be installed on each node for Kubernetes to work properly. This is also known as a container engine and is the software that runs the containers on the host operating system. Runtimes or engines are responsible for loading container images from a repository, monitoring resources, resource isolation, and managing the life cycle of the containers.

There are three types of container runtimes.

Low-level: All engines implementing the Open Container Interface, which is a standard way of how runtimes are implemented, are considered low-level container runtimes. Low-level container runtimes are an abstraction layer that only provide the facilities to to create and run containers. The de-facto standard low-level container runtime is runC, developed by Docker and the OCI Linux Foundation Project. crun is a version developed by Redhat to be lightweight and fast. Lastly, containerd is technically a low-level but offers an API abstraction layer.
High-Level Container Runtimes include Docker’s containerd, which contains features for managing containers outside of creating and running them. containerd is the leading system and offers free and paid options. CRI-O is an open source lightweight alternative to containerd.
Windows Containers & Hyper-V Containers can be thought of Microsoft Docker. Windows Containers uses the kernel process and namespace isolation to create the environment for each container. Hyper-V container are more secure because a VM is created for containers to be deployed into. The VM can have different operating systems allowing for greater flexibility.

In version 1.20, it was announced that direct integration with Docker Engine would be removed in later version. In version 1.24 it was finally removed. Kubernetes has several container runtimes listed on their site:

Containerd
CRI-O
Docker Engine
Mirantis Container Runtime

Installing one of the above mentioned CRI is straight forward when using a package manager. Each of the CRI require some configuration and I have personally found that containerd is the easiest of the bunch to install and configure. For more information about installing one of the CRI, see the project’s documentation.

cgroup drivers

The systemd driver is recommended for kubeadm setups. As of v1.22, this is the default.

Container Network Interface

The CNI project consists of specifications and libraries for writing network plugins that configure network interfaces in Linux containers. In order to do any sort of networking with Pods you must have a plugin installed. There are a number of available network plugins and all have a different set of features. Some of these plugins work with others to allow you to create more advanced configurations. To get started visit the CNI Github project page.

In this example we’re going to use Calico.

Limitations

When setting up, running, and managing various clusters, the kubeadm, kubelet, and kubectl should be the same version. However, it is possible to mix and match to a certain degree. If you arern’t able to maintain consistent versions for one reason or another, it is possible to run different versions based on the skew policy.

For our setup, we won’t need to worry about the limits of k8s. As a reference those limits are:

110 pods/node
5,000 nodes/cluster
150,000 pods/cluster
300,000 containers/cluster

Setup and Provisioning

Setup the Kubernetes Nodes

First thing is we need to create our virtual machines. Below is the Vagantfile and provisioning script that will get the VMs setup. For this, I will be using Rocky Linux, CentOS’s successor and RedHat EL compatible distribution.

During the provisioning, we need to add Docker’s repository even though we will not be installing docker. This is because containerd is in that repository and not one by itself.

The following Vagrantfile is a bit overly complex for a simple setup and sample. This format allows us to easily add different types of nodes, such as storage, in the future.

Vagrantfile

# -*- mode: ruby -*-
# vi: set ft=ruby :

ENV['VAGRANT_NO_PARELLEL']              = 'yes'
ENV['VAGRANT_BOX_UPDATE_CHECK_DISABLE'] = 'yes'
ENV['VAGRANT_DEFAULT_PROVIDER']         = 'libvirt'

VAGRANT_BOX                             = "generic/rocky9"
VAGRANT_BOX_VERSION                     = "4.2.14"
VIRTUAL_CPUS                            = 2
VIRTUAL_MEMORY                          = 2048
VIRTUAL_NETWORK                         = "172.16.16"
VIRTUAL_DOMAIN                          = "example.com"

vms = {
  "nodes" => {
    "control-plane" => {
      "01" => {
        cpus: VIRTUAL_CPUS,
        memory: VIRTUAL_MEMORY,
        ip: "#{VIRTUAL_NETWORK}.2",
      }
    },
    "worker" => {
      "01" => {
        cpus: VIRTUAL_CPUS,
        memory: VIRTUAL_MEMORY,
        ip: "#{VIRTUAL_NETWORK}.12",
      },
      "02" => {
        cpus: VIRTUAL_CPUS,
        memory: VIRTUAL_MEMORY,
        ip: "#{VIRTUAL_NETWORK}.13",
      },
      "03" => {
        cpus: VIRTUAL_CPUS,
        memory: VIRTUAL_MEMORY,
        ip: "#{VIRTUAL_NETWORK}.14",
      },
    }
	}
}

inventory_groups = {
  "control_plane" => [
    "control-plane01"
  ],
  "worker" => [
    "worker01",
    "worker02",
    "worker03"
  ]
}

Vagrant.configure("2") do |config|

  config.vm.provision "ansible" do |ansible|
    ansible.groups = inventory_groups
    ansible.playbook = "setup.yml"
    ansible.become = true
    ansible.become_user = "root"
	end

  vms.each_pair do |vm_group_name, vm_group|

    if vm_group_name == "nodes"

      vm_group.each_pair do |node_type_name, node_type_group|

        node_type_group.each_pair do |node_name, node_config|

          config.vm.define node_type_name+node_name do |node|

            # Generic, global VM configuration
            node.vm.box = VAGRANT_BOX
            node.vm.box_version = VAGRANT_BOX_VERSION
            node.vm.hostname = "#{node_type_name}#{node_name}.#{VIRTUAL_DOMAIN}"
            node.vm.network "private_network", ip: node_config[:ip]

            if node_type_name == "control-plane"

              # Generic control-plane node specific configuration

              node.vm.provider :libvirt do |provider|
                # Do something specific to libvirt
                provider.nested = true
              end

              node.vm.provider :virtualbox do |provider|
                # Do something specific to virtualbox
              end

            end

            if node_type_name == "worker"

              # Generic worker node specific configuration

              node.vm.provider :libvirt do |provider|
                
              end

              node.vm.provider :virtualbox do |provider|
                
              end
            end
          end
        end
      end
    end
  end
end

The Playbook

Before we get to the playbook there are two templates that we need to create first.

kubernetes.repo.j2 is a file that will be copied to all nodes so the necessary packages can be installed.

[kubernetes]
name=Kubernetes
baseurl=http://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
exclude=kubelet kubeadm kubectl

kernel_modules.conf.j2 is used to ensure that the proper kernel modules are loaded each time.

{% for module in kernel_modules %}
{{ module }}
{% endfor %}

Finally we have the actual playbook

---
- hosts: all
  become: yes
  become_method: sudo
  vars:
    kernel_modules:
      - br_netfilter
      - overlay
      - ip_vs
      - ip_vs_rr
      - ip_vs_wrr
      - ip_vs_sh
      - nf_conntrack
    kubernetes_version: "1.26.0"
    main_control_plane: "control-plane01"
    main_control_plane_nic: "eth0"
    pod_network_cidr: "192.168.0.0/16"
    required_packages:
      - vim
      - wget
      - curl
    timezone: "America/Chicago"
  pre_tasks:
    - name: Set API Server Advertise IP
      set_fact:
        main_control_plane_ip: "{{ hostvars[inventory_hostname]['ansible_%s' | format(item)].ipv4.address }}"
      loop: "{{ ansible_interfaces }}"
      when: inventory_hostname == main_control_plane and (main_control_plane_nic is defined and item == main_control_plane_nic)

    - name: "Add API Server Advertise to kubeadm init parameters"
      set_fact:
        kubeadm_init_params: "--apiserver-advertise-address={{ main_control_plane_ip }}"
      when: inventory_hostname == main_control_plane and main_control_plane_ip is defined

    - name: "Add Pod network CIDR to kubeadm init parameters"
      set_fact:
        kubeadm_init_params: "{{ kubeadm_init_params }} --pod-network-cidr={{ pod_network_cidr }}"
      when: inventory_hostname == main_control_plane and pod_network_cidr is defined

  tasks:
    - name: "Disable SELinux completely"
      ansible.builtin.lineinfile:
        path: "/etc/sysconfig/selinux"
        regexp: "^SELINUX=.*"
        line: "SELINUX=disabled"

    - name: "Reboot system"
      ansible.builtin.reboot:
        reboot_timeout: 120

    - name: "Set timezone"
      ansible.builtin.shell: "timedatectl set-timezone {{ timezone }}"

    - name: "Enable NTP"
      ansible.builtin.shell: "timedatectl set-ntp 1"

    - name: "Turn off SWAP"
      ansible.builtin.shell: "swapoff -a"

    - name: "Disable SWAP in fstab"
      ansible.builtin.replace:
        path: "/etc/fstab"
        regexp: '^([^#].*?\sswap\s+.*)$'
        replace: '# \1'

    - name: "Stop and Disable firewall (firewalld)"
      ansible.builtin.service:
        name: "firewalld"
        state: stopped
        enabled: no

    - name: "Load required modules"
      community.general.modprobe:
        name: "{{ item }}"
        state: present
      with_items: "{{ kernel_modules }}"

    - name: "Enable kernel modules"
      ansible.builtin.template:
        src: "kernel_modules.conf.j2"
        dest: "/etc/modules-load.d/kubernetes.conf"

    - name: "Update kernel settings"
      ansible.posix.sysctl:
        name: "{{ item.name }}"
        value: "{{ item.value }}"
        sysctl_set: yes
        state: present
        reload: yes
      ignore_errors: yes
      with_items:
        - { name: net.bridge.bridge-nf-call-ip6tables, value: 1 }
        - { name: net.bridge.bridge-nf-call-iptables, value: 1 }
        - { name: net.ipv4.ip_forward, value: 1 }

    - name: "Update system packages"
      ansible.builtin.package:
       name: "*"
       state: latest

    - name: "Install required package"
      ansible.builtin.package:
        name: "{{ item }}"
      with_items:
        - vim
        - wget
        - curl
        - gnupg

    - name: "Add Docker repository"
      get_url:
        url: "https://download.docker.com/linux/centos/docker-ce.repo"
        dest: "/etc/yum.repos.d/docer-ce.repo"

    - name: "Install containerd"
      ansible.builtin.package:
        name: ['containerd.io']
        state: present

    - name: "Create containerd directories"
      ansible.builtin.file:
        path: "/etc/containerd"
        state: directory

    - name: "Configure containerd"
      ansible.builtin.shell: "containerd config default > /etc/containerd/config.toml"

    - name: "Enable cgroup driver as systemd"
      ansible.builtin.lineinfile:
        path: "/etc/containerd/config.toml"
        regexp: 'SystemdCgroup \= false'
        line: 'SystemdCgroup = true'

    - name: "Start and enable containerd service"
      ansible.builtin.systemd:
        name: "containerd"
        state: restarted
        enabled: yes
        daemon_reload: yes

    - name: "Add kubernetes repository"
      ansible.builtin.template:
        src: "kubernetes.repo.j2"
        dest: "/etc/yum.repos.d/kubernetes.repo"

    - name: "Install Kubernetes packages"
      ansible.builtin.yum:
        name: "{{ item }}-{{ kubernetes_version }}"
        disable_excludes: kubernetes
      with_items: ['kubelet', 'kubeadm', 'kubectl']

    - name: "Enable kubelet service"
      ansible.builtin.service:
        name: kubelet
        enabled: yes

    # Control Plane Tasks
    - name: Pull required containers
      ansible.builtin.shell: "kubeadm config images pull >/dev/null 2>&1"
      when: ansible_hostname == main_control_plane

    - name: Initialize Kubernetes Cluster
      ansible.builtin.shell: "kubeadm init {{ kubeadm_init_params }} >> /root/kubeinit.log 2> /dev/null"
      when: ansible_hostname == main_control_plane

    - name: Deploy Calico network
      ansible.builtin.shell: "kubectl --kubeconfig=/etc/kubernetes/admin.conf create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/tigera-operator.yaml >/dev/null 2>&1"
      ignore_errors: yes
      when: ansible_hostname == main_control_plane

    - name: Install Calico by creating necessary custom resources
      ansible.builtin.shell: "kubectl --kubeconfig=/etc/kubernetes/admin.conf create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/custom-resources.yaml >/dev/null 2>&1"
      ignore_errors: yes
      when: ansible_hostname == main_control_plane

    - name: Generate and save cluster join command
      ansible.builtin.shell: "kubeadm token create --print-join-command > /joincluster.sh 2>/dev/null"
      when: ansible_hostname == main_control_plane

    - name: Download join command
      ansible.builtin.fetch:
        dest: './'
        flat: yes
        src: '/joincluster.sh'
      when: ansible_hostname == main_control_plane

    - name: Download admin.conf
      ansible.builtin.fetch:
        dest: "./"
        flat: yes
        src: "/etc/kubernetes/admin.conf"
      when: ansible_hostname == main_control_plane


    # Worker Tasks
    - name: Upload join command
      ansible.builtin.copy:
        src: joincluster.sh
        dest: /joincluster.sh
        owner: root
        group: root
        mode: "0777"
      when: ansible_hostname != main_control_plane

    - name: Reset
      ansible.builtin.shell: "kubeadm reset -f"
      when: ansible_hostname != main_control_plane

    - name: Join node to cluster
      ansible.builtin.shell: "/joincluster.sh > /dev/null 2&>1"
      when: ansible_hostname != main_control_plane

Note: In many of the online tutorials that use Ubuntu, the spiserver-advertise-address was set to the private IP address, 172.16.16.12, for example. This worked fine for Ubuntu clusters, but did not work for RHEL based clusters. This may simply be a Vagrant and routing issue and not something you would run into in a real environment.

Testing it out

Now that everything is up and running you can now play with the cluster one of two ways. The first is logging into control-plane01 and issuing kubectl commands.

vagrant ssh control-plane01
sudo su
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes
NAME                          STATUS   ROLES           AGE     VERSION
control-plane01.example.com   Ready    control-plane   2d20h   v1.26.3
worker01.example.com          Ready    <none>          2d20h   v1.26.3
worker02.example.com          Ready    <none>          2d20h   v1.26.3
worker03.example.com          Ready    <none>          2d20h   v1.26.3

The second way is running it from the host machine. When the playbook ran, it downloaded the /etc/kubernetes/admin.conf from the main control plane node and placed it into the playbook’s directory.

To use the cluster:

export KUBECONFIG=$PWD/admin.conf
kubectl create deployment nginx-web --image=nginx
deployment.apps/nginx-web created
kubectl get deployments -o wide

NAME        READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES   SELECTOR
nginx-web   1/1     1            1           16s   nginx        nginx    app=nginx-web
kubectl get pods

Troubleshooting

Connection Refused

Things to check:

The correct config is being used. KUBECONFIG environment variable may be set to the wrong config file or $HOME/.kube/config is incorrect.
The config file has incorrect permissions; chown $(id -u):$(id -g) config
Verify that the server element within the config file matches control plane; DNS, IP, and/or port.
Ensure that the firewall is turned off or the necessary ports are open. For the control plane nodes: 6443, 2379-2380, 10250, 10259, and 10257. For worker nodes: 10250 and 30000-32767
SELinux is set to PERMISSIVE or misconfigured.
Verify that the Kubernetes API server is running on the control plane nodes.

Control plane ports

Protocol	Direction	Port/Range	Purpose	Used By
TCP	IN	6443	Kubernetes API Server	All
TCP	IN	2379-2380	etcd server client API	kube-api-server,etc
TCP	IN	10250	Kubelet API	Self, Control plane
TCP	IN	10259	kube-scheduler	Self
TCP	IN	10257	kube-controller-manager	Self

Worker node(s)

Protocol	Direction	Port/Range	Purpose	Used By
TCP	IN	10250	Kubelet API	Self, Control plane
TCP	IN	30000-32767	NodePort Services	All

ImagePullBackOff

The ImagePullBackOff status means that the container cannot start because the image is unavailable for one reason or another. Often times it is something as simple as typo or incorrect tag.

Stuck on ContainerCreating

kubectl get pods
NAME                         READY   STATUS              RESTARTS   AGE                                                             
nginx-web-7fc79595fb-5rz9f   0/1     ContainerCreating   0          13m  
kubectl describe pod nginx
Name:           nginx-web-7fc79595fb-5rz9f
Namespace:      default
Priority:       0
Node:           worker02.example.com/192.168.121.111
Start Time:     Sat, 01 Apr 2023 00:06:54 -0500
Labels:         app=nginx-web
                pod-template-hash=7fc79595fb
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/nginx-web-7fc79595fb
Containers:
  nginx:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cvz4b (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-cvz4b:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                             From                           Message
  ----     ------                  ----                            ----                           -------
  Normal   Scheduled               <invalid>                       default-scheduler              Successfully assigned default/nginx-web-7fc79595fb-5rz9f to worker02.example.com
  Warning  FailedCreatePodSandBox  <invalid>                       kubelet, worker02.example.com  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cbb3a673941ec7fbb2fd399ddacbb1d6b58bc338f673f66e1d4d6d2f7e1471c0": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
  Normal   SandboxChanged          <invalid> (x71 over <invalid>)  kubelet, worker02.example.com  Pod sandbox changed, it will be killed and re-created.
  
kubectl get events --sort-by=.metadata.creationTimestamp
LAST SEEN   TYPE      REASON                    OBJECT                             MESSAGE
25m         Normal    Starting                  node/control-plane03.example.com   Starting kubelet.
25m         Warning   InvalidDiskCapacity       node/control-plane01.example.com   invalid capacity 0 on image filesystem
25m         Warning   InvalidDiskCapacity       node/worker04.example.com          invalid capacity 0 on image filesystem
25m         Normal    Starting                  node/worker04.example.com          Starting kubelet.
25m         Normal    NodeHasSufficientMemory   node/worker03.example.com          Node worker03.example.com status is now: NodeHasSufficientMemory
25m         Normal    Starting                  node/worker03.example.com          Starting kubelet.
26m         Warning   InvalidDiskCapacity       node/worker02.example.com          invalid capacity 0 on image filesystem
25m         Normal    Starting                  node/control-plane01.example.com   Starting kubelet.
25m         Normal    Starting                  node/control-plane02.example.com   Starting kubelet.
25m         Warning   InvalidDiskCapacity       node/control-plane02.example.com   invalid capacity 0 on image filesystem
26m         Normal    Starting                  node/worker02.example.com          Starting kubelet.
25m         Normal    Starting                  node/worker01.example.com          Starting kubelet.
25m         Warning   InvalidDiskCapacity       node/worker01.example.com          invalid capacity 0 on image filesystem
25m         Warning   InvalidDiskCapacity       node/control-plane03.example.com   invalid capacity 0 on image filesystem
25m         Warning   InvalidDiskCapacity       node/worker03.example.com          invalid capacity 0 on image filesystem
25m         Normal    NodeHasSufficientPID      node/worker03.example.com          Node worker03.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeHasSufficientPID      node/worker02.example.com          Node worker02.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeHasNoDiskPressure     node/worker02.example.com          Node worker02.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasSufficientMemory   node/control-plane03.example.com   Node control-plane03.example.com status is now: NodeHasSufficientMemory
25m         Normal    NodeHasNoDiskPressure     node/control-plane03.example.com   Node control-plane03.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasSufficientPID      node/control-plane03.example.com   Node control-plane03.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeHasSufficientMemory   node/worker02.example.com          Node worker02.example.com status is now: NodeHasSufficientMemory
25m         Normal    NodeHasNoDiskPressure     node/worker03.example.com          Node worker03.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasSufficientPID      node/control-plane02.example.com   Node control-plane02.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeHasSufficientPID      node/worker04.example.com          Node worker04.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeHasNoDiskPressure     node/worker04.example.com          Node worker04.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasSufficientMemory   node/worker04.example.com          Node worker04.example.com status is now: NodeHasSufficientMemory
25m         Normal    NodeHasSufficientMemory   node/control-plane01.example.com   Node control-plane01.example.com status is now: NodeHasSufficientMemory
25m         Normal    NodeHasNoDiskPressure     node/control-plane01.example.com   Node control-plane01.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasNoDiskPressure     node/control-plane02.example.com   Node control-plane02.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasSufficientMemory   node/control-plane02.example.com   Node control-plane02.example.com status is now: NodeHasSufficientMemory
25m         Normal    NodeHasSufficientMemory   node/worker01.example.com          Node worker01.example.com status is now: NodeHasSufficientMemory
25m         Normal    NodeHasNoDiskPressure     node/worker01.example.com          Node worker01.example.com status is now: NodeHasNoDiskPressure
25m         Normal    NodeHasSufficientPID      node/worker01.example.com          Node worker01.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeHasSufficientPID      node/control-plane01.example.com   Node control-plane01.example.com status is now: NodeHasSufficientPID
25m         Normal    NodeAllocatableEnforced   node/worker03.example.com          Updated Node Allocatable limit across pods
25m         Normal    Starting                  node/control-plane03.example.com   
25m         Normal    NodeAllocatableEnforced   node/control-plane01.example.com   Updated Node Allocatable limit across pods
25m         Normal    NodeAllocatableEnforced   node/worker04.example.com          Updated Node Allocatable limit across pods
25m         Normal    NodeAllocatableEnforced   node/worker01.example.com          Updated Node Allocatable limit across pods
25m         Normal    NodeAllocatableEnforced   node/control-plane03.example.com   Updated Node Allocatable limit across pods
26m         Normal    NodeAllocatableEnforced   node/worker02.example.com          Updated Node Allocatable limit across pods
25m         Normal    NodeAllocatableEnforced   node/control-plane02.example.com   Updated Node Allocatable limit across pods
25m         Normal    Starting                  node/control-plane02.example.com   
25m         Normal    Starting                  node/control-plane01.example.com   
25m         Normal    Starting                  node/worker04.example.com          
25m         Normal    Starting                  node/worker02.example.com          
25m         Normal    Starting                  node/worker01.example.com          
25m         Normal    Starting                  node/worker03.example.com          
24m         Normal    RegisteredNode            node/worker03.example.com          Node worker03.example.com event: Registered Node worker03.example.com in Controller
24m         Normal    RegisteredNode            node/worker01.example.com          Node worker01.example.com event: Registered Node worker01.example.com in Controller
24m         Normal    RegisteredNode            node/worker02.example.com          Node worker02.example.com event: Registered Node worker02.example.com in Controller
24m         Normal    RegisteredNode            node/control-plane03.example.com   Node control-plane03.example.com event: Registered Node control-plane03.example.com in Controller
24m         Normal    RegisteredNode            node/control-plane01.example.com   Node control-plane01.example.com event: Registered Node control-plane01.example.com in Controller
24m         Normal    RegisteredNode            node/worker04.example.com          Node worker04.example.com event: Registered Node worker04.example.com in Controller
24m         Normal    RegisteredNode            node/control-plane02.example.com   Node control-plane02.example.com event: Registered Node control-plane02.example.com in Controller
17m         Normal    SuccessfulCreate          replicaset/nginx-web-7fc79595fb    Created pod: nginx-web-7fc79595fb-5rz9f
17m         Normal    ScalingReplicaSet         deployment/nginx-web               Scaled up replica set nginx-web-7fc79595fb to 1
17m         Normal    Scheduled                 pod/nginx-web-7fc79595fb-5rz9f     Successfully assigned default/nginx-web-7fc79595fb-5rz9f to worker02.example.com
17m         Warning   FailedCreatePodSandBox    pod/nginx-web-7fc79595fb-5rz9f     Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cbb3a673941ec7fbb2fd399ddacbb1d6b58bc338f673f66e1d4d6d2f7e1471c0": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
2m7s        Normal    SandboxChanged            pod/nginx-web-7fc79595fb-5rz9f     Pod sandbox changed, it will be killed and re-created.

Node Stuck on NotReady

kubectl get nodes
NAME                          STATUS     ROLES           AGE   VERSION
control-plane-01.ad.uwm.edu   Ready      control-plane   77m   v1.26.0
worker-01.ad.uwm.edu          NotReady   <none>          14m   v1.26.0
worker-02.ad.uwm.edu          NotReady   <none>          70m   v1.26.0

Pod Stuck on Terminating

error killing pod: failed to "KillPodSandbox" for "de84ae0e-d698-4d4b-bf1e-15b33c36b7ed" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"e48afbc42748bafe6aa369577381d1e346b2a5c023d20040ec05bc4cd3548119\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"

Container runtime network not ready

container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Verify that the /etc/cni/net.d directory isn’t empty then try restarting the pod:

kubectl delete pod calico-node-hjcpn

Testing out the Cluster

It might not seem like much but you have a working Kubernets cluster. You can interact with it in two ways. The first is by logging into the first control plane node and running the following commands to see that it’s working:

vagrant ssh control-plan01
sudo su
export KUBECONFIG=/etc/kubernetes/admin.conf
kubectl get nodes
NAME                          STATUS   ROLES           AGE   VERSION
control-plane01.example.com   Ready    control-plane   20d   v1.26.3
control-plane02.example.com   Ready    control-plane   20d   v1.26.3
control-plane03.example.com   Ready    control-plane   20d   v1.26.3
worker01.example.com          Ready    worker          20d   v1.26.3
worker02.example.com          Ready    worker          20d   v1.26.3
worker03.example.com          Ready    worker          20d   v1.26.3
worker04.example.com          Ready    worker          20d   v1.26.3

Another option is to run kubectl from your host machine. One of the tasks downloaded the admin.conf file into the same directory as the playbook.

export KUBECONFIG=admin.conf
kubectl get nodes
NAME                          STATUS   ROLES           AGE   VERSION
control-plane01.example.com   Ready    control-plane   20d   v1.26.3
control-plane02.example.com   Ready    control-plane   20d   v1.26.3
control-plane03.example.com   Ready    control-plane   20d   v1.26.3
worker01.example.com          Ready    worker          20d   v1.26.3
worker02.example.com          Ready    worker          20d   v1.26.3
worker03.example.com          Ready    worker          20d   v1.26.3
worker04.example.com          Ready    worker          20d   v1.26.3

To verify that the cluster is actually working, and can create containers from the host machine:

kubectl create deployment nginx-web --image=nginx
deployment.apps/nginx-web created

Troubleshooting

# kubectl get nodes
NAME                          STATUS     ROLES           AGE   VERSION
control-plane-01.ad.uwm.edu   NotReady   control-plane   39m   v1.26.0
worker-01.ad.uwm.edu          NotReady   <none>          36m   v1.26.0
worker-02.ad.uwm.edu          NotReady   <none>          35m   v1.26.0

What’s Next?

We just set up a kubernetes cluster and although it appears that it’s a highly available cluster it isn’t. We need to add more control plane nodes and a load balancer for it to be an actual HA cluster.

The next step would be to add another control plane node and a load balancer. That will be in another post.

Resources & Links

Tutorials

How to Install Kubernetes Cluster on RHEL 9