Breakdown CPU Utilization

Status
In progress
Created
Mar 4, 2025 02:29 PM
Tags
Ever looked at CPU usage and wondered why it's broken into user, system, idle, iowait, steal, nice, and more? Why nott just simple "CPU utilization" metric?
Let's explore these metrics using Linux system. Run the top command, and we'll see multiple CPU usage columns. Each of them represents different types of CPU activity:
notion image

Understanding CPU Usage Metrics

  1. us (User Time) – Time spent on user-space processes (application code).
  1. sy (System Time) – Time spent on kernel-space processes (OS tasks).
  1. ni (Nice Time) – Time spent on user processes with modified priority (nice values).
  1. id (Idle Time) – When the CPU is not doing any work. CPU utilization can be calculated as 100 - idle%.
  1. wa (IOWait Time) – Time waiting for I/O operations (e.g., slow disk or network).
  1. hi (Hardware Interrupts) – Time spent handling hardware events (e.g., disk, network, keyboard).
  1. si (Software Interrupts) – Time spent handling software-generated interrupts (e.g., timers, system calls).
  1. st (Steal Time) – CPU time "stolen" by the hypervisor in virtualized environments (time taken by other VMs).
Now, let’s break down each metric with real-world examples and how they help in troubleshooting.

CPU Idle and CPU Utilization

Formula:
CPU Utilization = 100 - id
Or alternatively:
CPU Utilization = us + sy + ni + wa + hi + si + st
Is high CPU utilization is a problem?
If CPU utilization is high, it doesn't always mean an issue - it might just a sign that system is actively working. This can also help assess whether a machine is efficiently utilized or underutilized and potentially wasting resources. In such cases, we might consider downgrading the CPU or shutting the machine down.
It's also important to understand that the CPU utilization metric accounts for all clock cycles spent on active tasks, including both user-space and kernel-space processes. With this concept, we must understand that the CPU Utilization number doesn't always correspond to performance issue. Consider these cases:
  • High CPU Utilization (> 80%) but system is responsive → Likely this mean the utilization is part of normal operation so no need to worries about the value
  • High CPU Utilization (> 80%) but the system is slow → In this case, we need to check other metrics of system performance metrics, such as load average and running processes. This could indicate the system is struggling to handle tasks. Are there too many processes for the system to manage? Is a specific process consuming excessive CPU? Begin troubleshooting by correlating CPU utilization with the tasks and processes involved.
In summary, high CPU utilization on its own doesn't mean a system is malfunctioning. Ut's only one piece of puzzle of the overall performance picture. CPU utilization however is most useful for assessing how well a machine is handling its assigned workload.

User Time vs Kernel Time

Next, let’s dive into User Time (us) versus Kernel Time (sy). User time (us) refers to the CPU time spent on application-level processing, while kernel time (sy) involves tasks like memory management, process scheduling, and handling I/O operations.
To better understand this, let's look at a couple of case examples:
Case 1: High User Time (us) – CPU-Bound Workload
For applications that are computationally intensive, like machine learning, data processing, or video rendering, you'll see a high user time. These workloads primarily run application-level code, resulting in a high user/kernel ratio.
Run:
openssl speed
Or
dd if=/dev/zero of=/dev/null bs=1M count=1000000
🔍 Observation in top:
notion image
  • us is high
  • sy is low
This indicates a CPU-heavy application that is focused on executing user-level code. The system spends very little time on kernel operations.
Case 2: High System Time (sy) – Kernel Overhead
On the other hand, when the system spends a significant amount of time on kernel operations (like system calls, disk I/O, or managing memory), you'll see a high system time. This typically happens with I/O-heavy tasks or processes that involve extensive system calls.
Run (on server):
sudo iperf3 -s
The server is waiting for incoming data and processing it via the network stack.
Run (on client):
iperf3 -c <server-ip> -t 60 -b 1G
Here, the client is sending data at a high rate (1 Gbps) to the server over the network for 60 seconds.
🔍 Observation in top:
notion image
  • sy is high
  • us is low
When we run tools like iperf3, the network traffic generated forces the kernel to handle large volumes of network packets. These packets require the kernel’s attention for processing, leading to high system time. The kernel is responsible for managing network buffers, handling protocol layers (like TCP/IP), and performing various system calls related to data transmission. This can significantly increase system time, especially when the bandwidth demand is high.
By examining the ratio between user and kernel time, you can gain insights into the type of workload your system is handling. Understanding this helps you tailor optimizations for better overall performance.

I/O Wait

The I/O wait (wa) metric represents the time the CPU spends waiting for I/O operations - such as disk reads/writes or network transfers - to commplete. A high wa value means the system is bottlenecked by slow I/O performance rather than CPU processing power.
Case: High I/O Wait (wa) – Disk Bottleneck
Example command will simulate disk-heavy operations:
dd if=/dev/zero of=testfile bs=1G count=5 oflag=direct
🔍 Observation in top:
notion image
  • wa is high (> 20%)
  • CPU is not overloaded but waiting for slow disk operations
High I/O wait indicates that performance improvements should target the storage or network subsystems rather than the CPU itself.

Steal Time – Virtualization and CPU Contention

When I run the top command, I monitors that st this value remains at 0. So, what does this number means? Why does it matter and when does it increase beyond 0?
notion image
Steal time is only relevant in virtualized environments—where most modern servers operate are. In such environments, multiple virtual machines (VMs) share the host’s CPU resources, competing for processing power. When CPU cycles are reallocated from one VM to another, the %st (steal time) metric indicates that the VM is losing CPU time to serve other workloads.
How does this happen? The hypervisor determines CPU allocation among VMs. When the host is under high CPU demand and lacks sufficient capacity to handle all running VMs, some VMs may be deprioritized, causing their %st value to rise, which leads to performance degradation.
To mitigate this, consider migrating the VM to another physical server or allocating additional CPU resources. For long-term improvements, optimizing inefficient code—such as reducing memory bloat or minimizing excessive SQL queries—can also enhance performance.

Interrupts Time (hi and si) – Diagnosing CPU Interrupts

Interrupts time are proportion of time when the CPU doing interrupts. CPU interrupts itself is basically a signals to temporarily pause normal processing to handle urgent tasks. Just like when someone suddenly calling you to pick phone call.
In CPU, there are two types of interrupts, the one type that come from hardware and the one other come from software.
Hardware interrupts (hi) occur when physical devices like disks, network cards, or USB peripherals request CPU attention. High hi value often indicates heavy device activity, such as intense network traffic or disk reads/writes.
Software interrupts (si) happen when programs request kernel services, like file access or network operations. High si can mean frequent system calls, often caused by excessive disk operations or inefficient processes.
Monitoring hi and si helps diagnose performance issues related to hardware activity or system call overhead. CPU interrupts might sometime grouped together into system time, since the process of hardware or software interrupts are cosnider to be kernel-level program.

Nice Time (ni) — Managing Low-Priority CPU Tasks

Nice time (ni) represents CPU usage by low-priority processes, whcih have a “nice” value greater than 0. Thhe Linux scheduler ensures these tasks don’t interfer with normal or high-priority processes.
A high ni value means many low-priory processes are running, consuming CPU but not slowing down critical tasks. This is common in backgroun jogs like indexing, backups, or data processing.
If system performance drops and ni is high`, consider adjusting priority levels (renice) or allocating more resoruces.

Summary