OpenShift Cluster Node Tuning Operator – Nodes on steroids

What do you prefer: manual or automatic transmissions?
I like to have the control over a car which a manual transmission provides – using the engine to slow down without brakes and being more efficient when overtaking. On the other hand, it’s nice not to involve the left leg all of the time and to keep both hands on the steering wheel. Using an automatic transmission in general is easier and my family prefers it, so I have no choice.

Wouldn’t it be great to be able to do things more efficiently and precisely but not do it in a manual way? It would be great if an automatic transmission always behaved as I wanted and needed at that exact moment.

Returning to an OpenShift scenario I’ll ask again:
Wouldn’t it be great to tweak my RHEL CoreOS node only when I need to, and to not have to do it manually?

You can do this using the OpenShift Cluster Node Tuning Operator. This operator gives the user an interface to add custom tuning to apply to nodes on specified conditions and to configure the kernel according to the user’s needs. More information can be found at github.
The Node Tuning Operator runs as a daemonset on every node in the cluster. Check it with the command:

$ oc get pods -n openshift-cluster-node-tuning-operator -o wide

NAME                                           READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
cluster-node-tuning-operator-847984d77-f92tv   1/1     Running   0          6h51m   10.128.0.17   skordas0813-6p5bl-master-1                  <none>           <none>
tuned-2gz29                                    1/1     Running   0          6h51m   10.0.0.4      skordas0813-6p5bl-master-1                  <none>           <none>
tuned-5hkmr                                    1/1     Running   0          6h51m   10.0.0.7      skordas0813-6p5bl-master-2                  <none>           <none>
tuned-5jv59                                    1/1     Running   0          6h50m   10.0.32.4     skordas0813-6p5bl-worker-centralus1-tkbxs   <none>           <none>
tuned-gvlnt                                    1/1     Running   0          6h50m   10.0.32.5     skordas0813-6p5bl-worker-centralus3-nrh4t   <none>           <none>
tuned-nvfb5                                    1/1     Running   0          6h51m   10.0.0.6      skordas0813-6p5bl-master-0                  <none>           <none>
tuned-xhpfx                                    1/1     Running   0          6h49m   10.0.32.6     skordas0813-6p5bl-worker-centralus2-xm865   <none>           <none>

Also, you can check the tuned custom resources:

$ oc get tuned -n openshift-cluster-node-tuning-operator 
NAME      AGE
default   5h31m

Let’s take a closer look at the default tuning

$ oc get tuned -o yaml -n openshift-cluster-node-tuning-operator

apiVersion: v1
items:
- apiVersion: tuned.openshift.io/v1
  kind: Tuned
  metadata:
    creationTimestamp: "2019-08-07T14:08:10Z"
    generation: 1
    name: default
    namespace: openshift-cluster-node-tuning-operator
    resourceVersion: "6878"
    selfLink: /apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds/default
    uid: c9f0361b-b91c-11e9-931e-000d3a9420dc
  spec:
    profile:
    - data: |
        [main]
        summary=Optimize systems running OpenShift (parent profile)
        include=${f:virt_check:virtual-guest:throughput-performance}

        [selinux]
        avc_cache_threshold=8192

        [net]
        nf_conntrack_hashsize=131072

        [sysctl]
        net.ipv4.ip_forward=1
        kernel.pid_max=>131072
        net.netfilter.nf_conntrack_max=1048576
        net.ipv4.neigh.default.gc_thresh1=8192
        net.ipv4.neigh.default.gc_thresh2=32768
        net.ipv4.neigh.default.gc_thresh3=65536
        net.ipv6.neigh.default.gc_thresh1=8192
        net.ipv6.neigh.default.gc_thresh2=32768
        net.ipv6.neigh.default.gc_thresh3=65536

        [sysfs]
        /sys/module/nvme_core/parameters/io_timeout=4294967295
        /sys/module/nvme_core/parameters/max_retries=10
      name: openshift
    - data: |
        [main]
        summary=Optimize systems running OpenShift control plane
        include=openshift

        [sysctl]
        # ktune sysctl settings, maximizing i/o throughput
        #
        # Minimal preemption granularity for CPU-bound tasks:
        # (default: 1 msec#  (1 + ilog(ncpus)), units: nanoseconds)
        kernel.sched_min_granularity_ns=10000000
        # The total time the scheduler will consider a migrated process
        # "cache hot" and thus less likely to be re-migrated
        # (system default is 500000, i.e. 0.5 ms)
        kernel.sched_migration_cost_ns=5000000
        # SCHED_OTHER wake-up granularity.
        #
        # Preemption granularity when tasks wake up.  Lower the value to
        # improve wake-up latency and throughput for latency critical tasks.
        kernel.sched_wakeup_granularity_ns=4000000
      name: openshift-control-plane
    - data: |
        [main]
        summary=Optimize systems running OpenShift nodes
        include=openshift

        [sysctl]
        net.ipv4.tcp_fastopen=3
        fs.inotify.max_user_watches=65536
      name: openshift-node
    - data: |
        [main]
        summary=Optimize systems running ES on OpenShift control-plane
        include=openshift-control-plane

        [sysctl]
        vm.max_map_count=262144
      name: openshift-control-plane-es
    - data: |
        [main]
        summary=Optimize systems running ES on OpenShift nodes
        include=openshift-node

        [sysctl]
        vm.max_map_count=262144
      name: openshift-node-es
    recommend:
    - match:
      - label: tuned.openshift.io/elasticsearch
        match:
        - label: node-role.kubernetes.io/master
        - label: node-role.kubernetes.io/infra
        type: pod
      priority: 10
      profile: openshift-control-plane-es
    - match:
      - label: tuned.openshift.io/elasticsearch
        type: pod
      priority: 20
      profile: openshift-node-es
    - match:
      - label: node-role.kubernetes.io/master
      - label: node-role.kubernetes.io/infra
      priority: 30
      profile: openshift-control-plane
    - priority: 40
      profile: openshift-node
  status: {}
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

The section spec.profiles is a list of profile definitions in which we define the names and values that the operator will set for the node. It is possible to define a child profile that you only want to use in other profiles using the include key. In the example above, the openshift profile is an example of this. We can also add a summary to describe the profile.

The spec.recommend section is a list of profile selection logic checks what conditions should be met for the operator to apply the correct profile on the node. This part may not be so obvious, so let’s look deeper.

Each check needs three pieces of information:
match – What conditions need to be met to apply the recommended profile? If the match part is omitted, then the operator assumes that the match is always true. More details below.
priority – smaller numbers are higher priority. If there is more than one profile that should be used, then the Node Tuning Operator will apply the profile with a higher priority.
profile – name of the profile from spec.profiles that should be used.

If you want to apply more than one profile at the same time, you need to create a new profile that will include other profiles.

What criteria needs to be met to apply a specific profile? Everything is managed by labels on nodes and pods. All conditions are in the match section.

Each match can have four definitions:
label – node or pod label.
value – node or pad label value – if it’s omitted, then operator will match on the existence of the label.
type – only node or pod values – It defines what label the operator should check. If it is omitted then the operator will check the node label.
match – type is array – nested additional matches – the operator will check this match only when the toplevel match returns true.

Reading the recommend section is much easier now. Let’s move on to the default recommendation. The operator will check each node independently to determine which profile should be used on which node.

    - match:
      - label: tuned.openshift.io/elasticsearch
        match:
        - label: node-role.kubernetes.io/master
        - label: node-role.kubernetes.io/infra
        type: pod
      priority: 10
      profile: openshift-control-plane-es

At the beginning, the operator will check if the node has a pod running on it with the tuned.openshift.io/elasticsearch label. If this match is true, it will check the nested match: If the node (node is implied – because type is omitted) has the labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra, the operator will apply the openshift-control-plane-es profile because it is a control plane or infra node running an elasticsearch pod.

If this second control plane/infra match is false, then the operator will move on and check the next match with lower priority:

    - match:
      - label: tuned.openshift.io/elasticsearch
        type: pod
      priority: 20
      profile: openshift-node-es

openshift-node-es profile will be applied only when the previous control plane/infra match returns false and the node is running a pod with thetuned.openshift.io/elasticsearch label.

As before, if there is no match we continue to the next match in priority:

    - match:
      - label: node-role.kubernetes.io/master
      - label: node-role.kubernetes.io/infra
      priority: 30
      profile: openshift-control-plane

openshift-control-plane profile will be applied only when the previous matches return false and the node is labeled node-role.kubernetes.io/masteror node-role.kubernetes.io/infra

Finally, if there were no matches by this point, the operator will apply openshift-node profile:

  - priority: 40
    profile: openshift-node

Because there is no match array, it is always true.

Now we can create our own profile:

  • Create a file with CustomResource: cool_app_ip_port_range.yaml
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: ports
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=A custom profile to extend local port range

      [sysctl]
      net.ipv4.ip_local_port_range="1024 65535"

    name: port-range

  recommend:
  - match:
    - label: cool-app
      value: extended-range
      type: pod
    priority: 25
    profile: port-range
  • Create new tuned and verify if is there
$ oc create -f cool_app_ip_port_range.yaml
tuned.tuned.openshift.io/ports created
$ oc get tuned -n openshift-cluster-node-tuning-operator
NAME      AGE
default   6h32m
ports     31s
  • Let’s check the value of net.ipv4.ip_local_port_range on each node:
for i in $(oc get nodes --no-headers -o=custom-columns=NAME:.metadata.name); do echo $i; oc debug node/$i -- chroot /host sysctl net.ipv4.ip_local_port_range; done

In my case each node has the same range:

net.ipv4.ip_local_port_range = 32768    60999
  • Create our own app and label it correctly
$ oc new-project my-cool-project
$ oc new-app django-psql-example
$ oc get pods -o wide -n my-cool-project | grep Running
django-psql-example-1-pgd67    1/1     Running     0          3m15s   10.128.2.10   skordas0813-6p5bl-worker-centralus3-nrh4t   <none>           <none>
postgresql-1-cw86k             1/1     Running     0          5m12s   10.131.0.14   skordas0813-6p5bl-worker-centralus1-tkbxs   <none>           <none>
$ oc label pod postgresql-1-cw86k -n my-cool-project cool-app=
$ oc label pod django-psql-example-1-pgd67 -n my-cool-project cool-app=extended-range
  • Check net.ipv4.ip_local_port_range once again on each node:
for i in $(oc get nodes --no-headers -o=custom-columns=NAME:.metadata.name); do echo $i; oc debug node/$i -- chroot /host sysctl net.ipv4.ip_local_port_range; done

On node skordas0813-6p5bl-worker-centralus3-nrh4t the value of net.ipv4.ip_local_port_range has been changed

net.ipv4.ip_local_port_range = 1024     65535

because a pod labeled cool-app=extended-range is running on this node!
If you change the matching label or just delete pod, project or ‘port’ tuned profile, then the range will be set back to the default kernel values.

Everything is managed by the OpenShift Cluster Node Tuning Operator and the profiles you use, so you don’t need to tweak any values on the nodes’ operating system. This results in an automatic transmission-like experience for operators of OpenShift.

Categories
OpenShift Container Platform, Operators
Tags
, , ,