Which techniques enable efficient live migration of GPUs across cloud hosts?

Cloud providers and researchers pursue live GPU migration to support maintenance, scaling, and fault tolerance while preserving high-performance workloads. Efficient migration requires coordinating device virtualization, state transfer, and network transport so that application-visible downtime is minimal and security and locality requirements are respected. Evidence from system research and vendor documentation underpins the practical techniques below and their trade-offs.

Core techniques

Early work by Christopher Clark at University of Cambridge introduced and evaluated pre-copy and post-copy migration strategies for memory-dominated VMs; those ideas translate to GPU memory and context state. In pre-copy migration the bulk of GPU memory is iteratively copied while the guest continues running; dirty-tracking bitmaps reduce repeated transfers but residual state must be synchronized before resume. Post-copy migration resumes execution at the destination and fetches missing pages on demand, reducing total migration time at the cost of potential page faults and degraded performance.

At the device layer, vendor solutions from NVIDIA Corporation and platform work from VMware, Inc. implement vGPU and mediated-device models that present virtual GPUs to guests while enabling centralized management of GPU state. Techniques include serializing and checkpointing GPU contexts, copying application buffers, and preserving driver-managed metadata. For user-space and containerized stacks, CRIU originated by Pavel Emelyanov at Virtuozzo enables checkpoint/restore of process state, which can be combined with device-state transfer to reconstruct a running GPU-accelerated process elsewhere.

Network and storage techniques accelerate transfers: zero-copy RDMA, compression and deduplication of GPU memory pages, delta encoding between successive checkpoints, and use of high-bandwidth links such as NVLink or intra-rack fabrics. Page-tracking hardware and driver cooperation are essential to avoid excessive re-transfer. Latency-sensitive inference workloads benefit from pre-warming and staged migration to keep tail latencies bounded.

Causes, consequences, and broader nuances

The need for live GPU migration stems from operational causes: hardware maintenance, autoscaling, spot-instance reclamation, and load balancing across hosts. Consequences include transient performance degradation, increased network and energy use during migration windows, and potential residual data leakage in GPU memory unless zeroization and strict tenancy controls are applied. Territorial and cultural factors matter when GPUs carry sensitive datasets; moving GPU-resident data across regions can conflict with data residency laws and local norms, so providers often restrict migration across legal boundaries. Human factors influence adoption: operations teams require clear tooling, vendor-certified drivers, and documented procedures to trust live migration in production. Balancing performance, security, and regulatory constraints is central to deploying these techniques at cloud scale.