AMDGPU KFD pauses and restores all user queues on a per-process basis during SVM invalidation, TTM eviction, system suspend, and similar events. The core problem is coarse-grained control: it leaves the GPU idle, makes restore traversal expensive, and introduces visible jitter. Keywords: AMDGPU, KFD, XNACK.
This article focuses on the full AMDGPU KFD Queue Quiesce/Restore mechanism
[Technical Snapshot]
| Parameter | Details |
|---|---|
| Target component | AMDGPU KFD / ROCm SVM subsystem |
| Implementation language | C |
| Runtime environment | Linux Kernel / ROCm |
| Key protocols/mechanisms | MMU Notifier, TTM, SVM, Retry Fault |
| Focus areas | Queue Quiesce/Restore, XNACK, Eviction |
| Supported hardware | GFX9, Aldebaran, MI200, MI300, and more |
| Reference popularity | Original article views: ~451 |
| Core dependencies | amdgpu, kfd_svm.c, kfd_process.c, TTM |
KFD quiesce/restore is fundamentally a safety strategy built around “stop first, repair next, resume last.” As soon as the GPU address space of a process becomes inconsistent, the driver prefers to pause all user queues for that process instead of only pausing the queue that appears to be affected.
This is not a crude implementation. It is a conservative design shaped by the user queue model, shared page tables, and limited kernel visibility. Understanding that constraint is the prerequisite for evaluating both performance costs and optimization boundaries.
The 9 quiesce/restore trigger scenarios can be grouped into four main sources
The most common source is the SVM path, especially MMU notifier invalidation when XNACK is off. The second major source is process-level eviction triggered by TTM under VRAM pressure. System suspend/resume and CRIU are lower-frequency control-plane operations.
| Category | Representative scenario | Quiesce entry | Restore entry | Frequency |
|---|---|---|---|---|
| SVM | MMU invalidation | svm_range_evict() |
svm_range_restore_work() |
Medium-High |
| SVM | queue-vital buffer unmap | svm_range_unmap_from_cpu() |
None | Extremely low |
| TTM | VRAM pressure eviction | evict_process_worker() |
restore_process_worker() |
Low-Medium |
| System | suspend/resume | kfd_suspend_all_processes() |
kfd_resume_all_processes() |
Extremely low |
| CRIU | checkpoint/restore | criu_checkpoint() |
Subsequent restore flow | Extremely low |
/* Only the first eviction triggers full-process quiesce to avoid repeated pauses */
evicted_ranges = atomic_inc_return(&svms->evicted_ranges);
if (evicted_ranges != 1)
return r; // Subsequent ranges only increase the counter and do not stop queues again
r = kgd2kfd_quiesce_mm(mm, KFD_QUEUE_EVICTION_TRIGGER_SVM);
This logic deduplicates repeated events through counting and compresses a burst of invalidations into a single queue stop.
Per-process granularity is determined jointly by the shared VM and the user queue submission model
All user queues within the same process share one GPU page table, which means they also share the VMID and page directory base. As a result, once the PTE for a BO or SVM range becomes invalid, any queue in theory may touch that invalid address.
The driver does not know which queue is definitely safe, so it can only choose to pause all of them. This is a textbook example of correctness taking priority over performance.
/* A queue is bound to a process-level page table, not a queue-private page table */
queue_input.process_id = pdd->pasid;
queue_input.page_table_base_addr = qpd->page_table_base; // All queues share the same page table
queue_input.process_va_start = 0;
queue_input.process_va_end = adev->vm_manager.max_pfn - 1;
This initialization code makes the isolation boundary explicit: the address space boundary exists at the process level, not at the queue level.
The kernel cannot reliably map BO access back to individual queues
In the graphics CS ioctl model, the kernel can inspect bo_list, so it can track resource dependencies per job. In the KFD user queue model, however, user space writes directly to the queue and doorbell, and the kernel cannot see the submitted dispatch content.
That means the kernel cannot answer two critical questions: which BOs a given queue is currently accessing, and which queues must be stopped when a specific BO is invalidated. Even if you attempted to build such tracking, dynamic map/unmap behavior, SVM migration, and shader pointer-based access patterns would break it.
The SVM restore path is the primary runtime performance bottleneck
Quiesce itself is often not the most expensive step. Restore is what usually hurts. During restore, the driver must acquire locks serially, traverse the range list, validate each entry, and rebuild GPU mappings one by one.
This becomes especially expensive for HPC workloads with large working sets. Even when only a small number of ranges are invalid, the implementation may still scan the entire SVM range list, which causes restore cost to scale linearly with total range count.
list_for_each_entry(prange, &svms->list, list) {
invalid = atomic_read(&prange->invalid);
if (!invalid)
continue; // Most ranges may only be visited and skipped
svm_range_validate_and_map(...); // Rebuild the GPU mapping for this range
}
This code exposes the core issue in the current implementation: restore complexity is closer to O(total_ranges) than O(evicted_ranges).
In XNACK-off mode, the main optimization space is concentrated in the restore path
The most valuable low-risk optimization is to introduce a dedicated evicted list. With that change, restore no longer needs to scan all ranges and only processes nodes that were actually marked by invalidation.
The second target is lock granularity. Today, process_info->lock, mmap_write_lock, and svms->lock are often held for too long. That increases contention and affects fault handling as well as other memory-management paths.
/* Before optimization: scan all ranges */
list_for_each_entry(prange, &svms->list, list) {
if (!atomic_read(&prange->invalid))
continue;
svm_range_validate_and_map(...);
}
/* After optimization: scan only evicted ranges */
list_for_each_entry_safe(prange, next, &svms->evicted_list, evict_link) {
svm_range_validate_and_map(...); // Restore only ranges that were actually invalidated
list_del(&prange->evict_link); // Remove from the evicted list after successful restore
}
The benefit is direct and measurable: restore complexity drops from full-list scanning to demand-driven scanning.
Delaying quiesce to batch invalidations is not viable
Intuitively, adding a time window around invalidation seems like a way to reduce jitter. But the MMU notifier contract is strict: before the invalidate callback returns, the device side must stop accessing invalid pages.
That means quiesce cannot be delayed until “later.” Otherwise, the GPU may still access stale PTEs and violate consistency semantics. This is why many seemingly reasonable batching optimizations do not work here.
XNACK-on is the fundamental solution for eliminating quiesce/restore on the SVM path
When XNACK is on, the GPU no longer treats an invalid PTE as fatal. Instead, it raises a retry fault. At that point, only the wavefront that hit the fault is paused, the kernel repairs the mapping on demand, and the hardware retries automatically.
The key change is structural: no full-process quiesce, no full restore pass, and no need to wait for all queues to resume before execution continues.
if (!p->xnack_enabled ||
(prange->flags & KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED)) {
evicted_ranges = atomic_inc_return(&svms->evicted_ranges);
if (evicted_ranges != 1)
return r;
r = kgd2kfd_quiesce_mm(mm, KFD_QUEUE_EVICTION_TRIGGER_SVM);
queue_delayed_work(..., &svms->restore_work, ...);
} else {
svm_range_unmap_from_gpus(prange, s, l, trigger); // Only invalidate GPU PTEs without stopping all queues
}
This branch almost defines the dividing line between the XNACK-on and XNACK-off performance models.
Retry fault shrinks restore granularity from the whole process to a single range
On the XNACK-on path, the fault handler restores only the address range that the current access actually hits. Other wavefronts can continue running, and the GPU does not go idle just because a small part of the address space changed.
That shifts SVM workloads from a “pause everything, then restore globally” model to an on-demand fault-and-map model, which is especially important for large-scale heterogeneous memory workloads.
prange = svm_range_from_addr(svms, addr, NULL);
...
r = svm_range_validate_and_map(mm, start, last, prange,
gpuidx, false, false, false); // Repair only the target range hit by the fault
The core value is not simply faster restore. It is narrower restore scope and lower runtime disruption.
XNACK-on still has limits and costs, but it remains the better default option
XNACK-on does not eliminate quiesce for TTM eviction, system suspend/resume, CRIU, or queue-vital buffer unmap scenarios. It primarily solves the most common runtime performance problem: SVM invalidation.
Its costs include retry fault interrupt overhead, more frequent TLB management, and hardware compatibility limits. GFX10+ is a notable example, where support is usually not enabled because reliable shader preemption during page fault handling is not available.
Platforms that support per-process XNACK should enable it first
Platforms such as Aldebaran, MI200, and MI300 can configure XNACK per process, making them the most suitable AMDGPU environments for high-frequency SVM workloads. The gains are most obvious for tasks that require large working sets and frequent CPU/GPU cooperative memory access.
If hardware or deployment constraints prevent XNACK from being enabled, then the most practical roadmap in XNACK-off mode is to prioritize the evicted-list optimization and improve lock granularity.
AI Visual Insight: This figure works more as a thematic illustration than a strict code flowchart, but it still captures the article’s central idea well: move from “pause and restore the entire process” toward smarter memory-consistency handling, highlighting the evolution of GPU drivers for AI and HPC workloads.
The conclusion is that the real optimization target is not quiesce itself, but the restore model
KFD uses per-process quiesce/restore not because of a design mistake, but because it is the inevitable result of combining user queues with a shared VM. What truly determines the performance ceiling is restore-path scan granularity, lock contention, and whether retry fault support is available.
The takeaway can be reduced to three points: optimize partial restore in XNACK-off mode, avoid imaginary delayed schemes that violate notifier semantics, and enable XNACK-on first on supported hardware.
FAQ
Q1: Why can’t KFD pause only the queues that access the problematic BO?
Because under the user queue model, the kernel cannot see the actual dispatch content submitted from user space, so it cannot build a reliable BO-to-Queue access map. In addition, queues within the same process share the same page table, which forces the driver to conservatively pause all queues in the process.
Q2: Can XNACK-on completely eliminate all quiesce/restore operations?
No. It mainly removes full-process pauses on the SVM MMU invalidation path. Scenarios such as TTM BO eviction, system suspend/resume, CRIU, and queue-vital buffer unmap still require process-level quiesce.
Q3: If the current platform cannot enable XNACK, what optimization should come first?
The first priority is to introduce an evicted list, reducing restore from O(total_ranges) to O(evicted_ranges). The second priority is to shorten lock hold time during restore and reduce prolonged use of mmap_write_lock and svms->lock.
[AI Readability Summary]
This article systematically reconstructs the AMDGPU KFD Queue Quiesce/Restore mechanism, covering nine trigger scenarios, the architectural reasons behind per-process granularity, the performance bottlenecks in SVM and TTM paths, and both practical XNACK-off optimizations and the long-term XNACK-on solution.