A guide to understanding how node-level tuning works on OpenShift

This blog post explores node-level tuning in OpenShift, outlining both the default configurations managed by the platform and the customizable options available to cluster administrators for optimizing performance.

Red Hat OpenShift Container Platform (RHOCP) is designed with out-of-the-box settings that efficiently support most general-purpose workloads, often eliminating the need for manual node-level adjustments. However, specific use cases—such as resource-intensive applications or specialized environments—may require fine-tuning to enhance performance. While these optimizations are typically implemented post-deployment (referred to as “day 2” operations), certain scenarios might demand configuration changes during the initial cluster setup (“day 1”) to align with unique workload requirements.

Option 1: Use the default settings

Like Red Hat Enterprise Linux (RHEL), OpenShift is preconfigured with optimized settings for general-purpose workloads by default. A key component enabling this tuning is the Node Tuning Operator (NTO), a foundational OpenShift operator that automates system performance adjustments.

The NTO applies a “parent OpenShift profile” that modifies kernel parameters to improve performance under demanding workloads or in large-scale clusters. These adjustments—such as increasing kernel limits—enhance responsiveness and scalability during periods of high system load. However, this optimization comes with a tradeoff: the tuned configurations may consume more kernel memory to support these performance gains.

cat tuned.conf 
#
# tuned configuration
#

[main]
summary=Optimize systems running OpenShift (parent profile)
include=${f:virt_check:virtual-guest:throughput-performance}

[selinux]
avc_cache_threshold=8192

[net]
nf_conntrack_hashsize=1048576

[sysctl]
kernel.pid_max=>4194304
fs.aio-max-nr=>1048576
net.netfilter.nf_conntrack_max=1048576
net.ipv4.conf.all.arp_announce=2
net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536
vm.max_map_count=262144

[sysfs]
/sys/module/nvme_core/parameters/io_timeout=4294967295
/sys/module/nvme_core/parameters/max_retries=10

[scheduler]
# see rhbz#1979352; exclude containers from aligning to house keeping CPUs
cgroup_ps_blacklist=/kubepods.slice/
# workaround for rhbz#1921738
runtime=0

To look the values of default performance tuning profile applied go into the tuned pod running in ==openshift-cluster-node-tuning-operator== namespace and got /usr/lib/tuned directory.

The OpenShift platform’s performance tuning is rooted in the throughput-performance profile, the default configuration recommended for server environments (inherited from Red Hat Enterprise Linux). This baseline is enhanced with additional functional and performance-focused adjustments tailored to OpenShift’s needs:

Functional Tunables

Resolve ARP failures: Address pod communication issues caused by ARP failures between nodes and pods.
Increase vm.max_map_count: Ensures Elasticsearch pods start cleanly by adjusting memory mapping limits.
Adjust kernel.pid_max: Supports higher pod density and stability for heavy workloads by raising the maximum process ID limit.

Performance Tunables

Scale cluster capacity: Enable support for large clusters with thousands of routes.
Optimize Netfilter conntrack hash table parameters: Adjust connection tracking limits to handle high network traffic.
Modify AVC cache thresholds: Reduce CPU overhead and improve node performance by tuning SELinux cache behavior.
Expand VM allocation: Add virtual machines (VMs) to Red Hat OpenShift Container Platform (RHOCP) nodes for workload flexibility.
Tune CPU scheduling:
- Align containers with dedicated housekeeping CPUs by disabling the TuneD scheduler plugin.
- Disable dynamic adjustments from the TuneD scheduler plugin for predictable performance.

Recent Updates in RHOCP

The openshift-control-plane profile now inherits directly from the base OpenShift profile.
The openshift-node profile includes operational parameters like fs.inotify settings, which mirror values preconfigured by the Machine Config Operator (MCO) to ensure they apply before the kubelet initializes.
Network latency reduction: Enable net.ipv4.tcp_fastopen=3 to allow data exchange during the initial TCP handshake (SYN phase), accelerating client-server connections.

Option 2: Custome Tuning

As indicated by the Linux kernel documentation for the ixgb driver, we will provide an example Tuned CR here for optimizing a system with a 10 Gigabit Intel(R) network interface card for throughput.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-network-tuning
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Increase throughput for NICs using ixgb driver
      include=openshift-node
      [sysctl]
      ### CORE settings (mostly for socket and UDP effect)
      # Set maximum receive socket buffer size.
      net.core.rmem_max = 524287
      # Set maximum send socket buffer size
      net.core.wmem_max = 524287
      # Set default receive socket buffer size.
      net.core.rmem_default = 524287
      # Set default send socket buffer size.
      net.core.wmem_default = 524287
      # Set maximum amount of option memory buffers.
      net.core.optmem_max = 524287
      # Set number of unprocessed input packets before kernel starts dropping them.
      net.core.netdev_max_backlog = 300000
    name: openshift-network-tuning
  recommend:
  - match:
    - label: node-role.kubernetes.io/worker
    priority: 20
    profile: openshift-network-tuning

Option 3: Low-Latency Tuning for Specialized Workloads

Workloads such as Telco 5G Core User Plane Function (UPF), Financial Services Industry (FSI) applications, and High-Performance Computing (HPC) often require real-time, low-latency tuning. However, achieving this level of optimization involves trade-offs:

Real-time kernels sacrifice overall throughput.
Static partitioning of system resources (e.g., dividing CPUs into “housekeeping” and “workload” partitions) risks oversubscribing critical resources and undermines OpenShift’s ability to dynamically allocate compute resources.
Power consumption may increase due to aggressive performance configurations.

Challenges of Static Partitioning

Effective partitioning requires coordination across multiple layers to avoid resource contention:

Isolate management pods from workload pods.
Use Guaranteed QoS pods for critical workloads.
Dedicate specific CPUs to system processes and kernel threads.
Redirect NIC interrupt requests (IRQs) to housekeeping CPUs.

Strategies to Reduce Software Latency

Beyond partitioning, additional optimizations include:

Real-time kernel: Prioritize deterministic task scheduling.
Huge Pages (per NUMA node): Minimize TLB misses for memory-intensive workloads.
Disable CPU load balancing: For DPDK-based applications to avoid core migrations.
Remove CPU CFS quotas: Eliminate throttling for latency-sensitive tasks.
Hyperthreading disablement: Reduce latency variability.
BIOS tuning: Optimize hardware-level settings (e.g., power management, prefetching).

Automating Configuration with the Performance Profile Controller

Manually implementing these optimizations is error-prone and requires meticulous coordination. The Node Tuning Operator’s (NTO) Performance Profile Controller simplifies this by orchestrating configurations across components such as:

The Linux kernel (via boot parameters or real-time settings).
TuneD profiles for system tuning.
Kubelet (CPU pinning, topology, memory management).
CRI-O (container runtime optimizations).

A PerformanceProfile custom resource defines the desired state, ensuring consistency across the stack. For example, a single-node OpenShift deployment might use a PerformanceProfile to align kernel, runtime, and workload settings with low-latency requirements.

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  cpu:
    isolated: "2-31,34-63"
    reserved: "0-1,32-33"
  globallyDisableIrqLoadBalancing: false
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 32
      node: 0
  net:
    userLevelNetworking: false
    devices: []
  nodeSelector:
    node-role.kubernetes.io/master: ""
  numa:
    topologyPolicy: "best-effort"
  realTimeKernel:
    enabled: true

The PerformanceProfile allocates CPUs 2–31 and 34–63 for low-latency workloads, reserving the remaining four CPUs for system-level tasks. However, these reserved CPUs may not always handle device interrupts adequately. To mitigate this, the profile leaves IRQ load balancing enabled globally (globallyDisableIrqLoadBalancing: false), allowing interrupts to be processed on the isolated, latency-tuned CPUs. For granular control, administrators can still disable IRQ load balancing on specific pod CPUs using CRI-O annotations such as irq-load-balancing.crio.io and cpu-quota.crio.io=disable, ensuring critical workloads avoid interruptions.

The configuration includes 32 x 1 GiB huge pages on NUMA node 0 to reduce memory access overhead, alongside the Topology Manager’s best-effort NUMA alignment policy to optimize resource locality. A real-time kernel is enabled to prioritize deterministic scheduling, though NIC queues remain unrestricted to reserved CPUs, avoiding artificial bottlenecks for high-throughput networking. Beyond these explicit settings, the PerformanceProfile Controller applies implicit system-level adjustments. For example, it activates the CPU Manager’s static policy to enforce exclusive CPU assignments and mandates full physical core allocation under strict topology policies like restricted or single-numa-node. To expedite reclamation of isolated CPUs from non-Guaranteed pods, the CPU Manager’s reconcile period is shortened, albeit at the cost of increased resource consumption. The Memory Manager is configured to pin memory and huge pages near allocated CPUs under these policies, while CRI-O is set up with a dedicated runtime class for low-latency workloads, referenced in pods via runtimeClassName. Additionally, TuneD manages the stalld daemon to preempt scheduler-induced latency spikes.

Despite these comprehensive adjustments, the PerformanceProfile alone does not guarantee low-latency operation. Pod specifications must explicitly reference the configured runtime class, apply relevant CRI-O annotations, and adhere to the Guaranteed QoS class by defining matching resource limits and requests. This ensures strict CPU isolation, prioritized scheduling, and alignment with the underlying tuned infrastructure.
Following is the sample pod specification

apiVersion: v1
kind: Pod
metadata:
  name: example
  # Disable CFS cpu quota accounting
  cpu-quota.crio.io: "disable"
  # Disable CPU balance with CRIO
  cpu-load-balancing.crio.io: "disable"
  # Opt-out from interrupt handling
  irq-load-balancing.crio.io: "disable"
spec:
  # Map to the correct performance class
  runtimeClassName: get-from-performance-profile
  ...
  containers:
  - name: container-name
    image: image-registry/image
    resources:
      limits:
        memory: "2Gi"
        cpu: "16"
  ...

It’s important to note that the Node Tuning Operator’s (NTO) Performance Profile controller automatically overrides any manual Kubelet configurations. To preserve custom Kubelet adjustments, administrators must explicitly annotate the PerformanceProfile with these settings. Similarly, additional TuneD configurations can be layered on top of or replace the defaults provided by the controller, offering flexibility to refine system tuning beyond the base profile.

PerformanceProfiles introduce deeper partitioning into Red Hat OpenShift Container Platform (RHOCP), a strategy particularly relevant for latency-sensitive workloads like Data Plane Development Kit (DPDK) applications. These applications bypass kernel networking stacks to process packets directly in user space, requiring isolation from hardware interrupts. While such partitioning benefits real-time workloads, it carries inherent trade-offs. Reserved cores dedicated to workloads may leave insufficient resources for operating system processes or RHOCP management pods, risking instability or degraded cluster operations. For this reason, implementing granular partitioning demands rigorous planning and testing to balance performance gains against potential system-wide compromises.

Conclusion

Red Hat OpenShift Container Platform (RHOCP) administrators have several strategies to optimize node performance, but three critical considerations should guide their approach. First, while most tuning tasks—such as applying Node Tuning Operator (NTO) configurations or custom Tuned Profiles—can be implemented post-installation, administrators must decide whether adjustments are best made during or after cluster deployment. Second, strict resource partitioning via NTO’s PerformanceProfiles can isolate latency-sensitive workloads from “noisy neighbors,” but this comes at the cost of potentially inefficient CPU utilization. If minimizing resource contention is a priority despite this tradeoff, PerformanceProfiles offer a viable path. Finally, any node-level tuning strategy must account for the frequency of node reboots, as changes like kernel parameter adjustments or topology policies often require restarts. Balancing performance gains with cluster stability requires careful planning to minimize disruptions.