Near-term quantum processors are noisy, limited in qubit count, and sensitive to architecture-specific errors. Benchmarks translate those messy realities into actionable numbers that guide researchers, funders, and engineers. Different metrics answer different questions: gate-level accuracy, end-to-end algorithmic capability, or the ability to reproduce random quantum outputs. Clear benchmarks help prioritize improvements that matter for practical quantum advantage.
Core benchmarks
Randomized benchmarking measures average gate performance by applying random sequences of gates and observing decay in success probability. Researchers E. Magesan, J. M. Gambetta, and J. Emerson developed formal randomized benchmarking protocols at IBM Quantum and the University of Waterloo that separate coherent and incoherent error contributions and scale more favorably than full tomography. Gate fidelity extracted from randomized benchmarking is a bedrock metric for calibrating control electronics and pulse shaping.
Quantum Volume is a holistic metric introduced and promoted by IBM Quantum with contributions from Jay M. Gambetta that captures qubit count, connectivity, and error rates in a single number. Quantum Volume emphasizes the largest square circuit a machine can run with acceptable success; it reflects not just raw qubit number but usable computational power, which matters when architectures have limited connectivity or high crosstalk.
Cross-entropy benchmarking was central to Google Quantum AI’s Sycamore experiment led by John M. Martinis and Sergey Boixo. It compares the sampled output distribution of a quantum circuit against the ideal distribution and is sensitive to cumulative noise across deep circuits. Cross-entropy benchmarking excels at assessing devices aimed at sampling tasks and provides a direct route to claims about near-term computational supremacy, but it is most informative for random-circuit sampling rather than structured algorithmic workloads.
Beyond these, gate-set tomography and process tomography give richer, though more resource-intensive, characterizations of error channels and have been advanced by several academic groups to diagnose systematic errors. Benchmarks are complemented by workload-specific tests that emulate variational algorithms or error-mitigation pipelines to show how hardware traits map to performance on intended applications.
Interpreting benchmarks in practice
Benchmarks do not exist in a vacuum; their causes and consequences affect scientific and social decisions. Measured limitations stem from physical causes such as decoherence, control noise, crosstalk, and thermal management requirements for cryogenic platforms. The consequences are practical: funding priorities shift toward error mitigation, improved fabrication, or software optimization depending on whether errors are coherent or stochastic. Benchmarks also shape regional and institutional strategies—national labs and companies in different territories may emphasize superconducting circuits, trapped ions, or photonics based on local expertise, supply chains, and regulatory environments, which in turn influences workforce development and industrial ecosystems.
For policymakers and researchers, the most reliable assessment combines multiple benchmarks: gate-level fidelities and randomized benchmarking for control quality, quantum volume for overall capability, and cross-entropy or application-specific tests for end-to-end behavior. Interpreting these numbers requires understanding the measurement method and the target workload; otherwise improvements in a single metric can obscure unresolved practical barriers to deploying quantum algorithms.