NVIDIA Rubin CPX: GDDR7 Prefill Offload Reshapes TCO

NVIDIA Rubin CPX moves prefill to a GDDR7-backed accelerator, disaggregating inference to cut HBM exposure, densify liquid-cooled racks, and improve perf/W and TCO for million-token contexts, as outlined in NVIDIA’s own developer guidance on long-context inference workloads (NVIDIA developer blog).

Table of Contents

Why Rubin CPX represents a strategic shift in AI inference architecture

Model contexts grew, batches widened, and latency targets tightened. Those pressures exposed a split: prefill is compute-heavy and cost-sensitive; decode is bandwidth-bound and latency-critical. Rubin CPX formalizes that division of labor. It concentrates FLOPS and local, cost-efficient memory on prefill, then hands off to HBM-rich GPUs for token-by-token decode. The practical implication: operators can right-size racks around mixed memory profiles instead of overprovisioning HBM across the entire workload.

The strategy goes beyond marketing. NVIDIA is shipping rack-level systems that dedicate trays to prefill and others to decode, making the split tangible in procurement and deployment. This reframes how memory, cooling, and fabric capacity are allocated in production, especially when context windows stretch into the million-token class.

Disaggregated inference explained: prefill vs. decode

Prefill constructs the full context—prompt ingestion, retrieval, and attention over long windows. It is dominated by matrix math, benefits from high compute throughput, and tolerates lower per-device memory bandwidth than the decode phase. Decode then generates tokens autoregressively. It constantly reads and updates KV caches and activations, making it highly sensitive to HBM bandwidth and end-to-end latency.

Rubin CPX targets prefill. By maximizing compute density and pairing it with fast GDDR7, CPX handles the context build efficiently, then transfers state to HBM-equipped Rubin GPUs optimized for decode. NVIDIA explicitly frames the design around very long contexts and multimodal inputs such as long-form video and large codebases, with software pathways to route prefill there and return decode to bandwidth-dense parts (NVIDIA developer blog).

Inside Rubin CPX: architecture and packaging choices

Rubin CPX swaps HBM stacks for GDDR7 on a wide bus to raise FLOPS-per-dollar on prefill workloads. Public coverage points to a configuration with large GDDR7 capacity over a 512-bit interface, delivering roughly terabytes-per-second of local bandwidth, and compute throughput in the tens of petaFLOPS at NVFP4 precision. NVFP4 is NVIDIA’s 4-bit floating format tuned for inference; it trades precision for much higher throughput on kernels where accuracy is preserved through quantization-aware training or post-training calibration.

Packaging follows from the memory choice. GDDR7 avoids 2.5D interposers and HBM stack attach, which simplifies assembly, reduces exposure to interposer yield, and eases thermal design. That, in turn, allows denser trays and more predictable board-level power profiles. NVIDIA also highlights media encode/decode blocks that keep long-form audio/video ingress on-device during prefill before the resulting context shifts to decode GPUs. The trade-off is explicit: CPX concedes peak bandwidth per GPU against HBM parts but wins on cost, yield exposure, and compute density aligned to prefill kernels.

The net for operators is a cleaner mapping between work and silicon: compute-dense, GDDR-based devices for context construction, and bandwidth-dense, HBM-based devices for latency-critical token emission.

System-level impact: new rack SKUs and liquid-cooled trays

NVIDIA’s rack design makes the separation concrete. The Vera Rubin NVL144 CPX configuration pairs 144 Rubin CPX GPUs dedicated to prefill with 144 HBM-equipped Rubin GPUs for decode, orchestrated alongside Vera CPUs. Coverage describes rack-level figures on the order of multiple exaFLOPS of NVFP4 compute, pooled fast memory around the hundred-terabyte scale, and aggregate memory bandwidth in the multi-petabytes-per-second class—capabilities that move million-token inference into a single-rack conversation (SemiAnalysis).

To reach those densities, NVIDIA leans on liquid-cooled trays and standardized modules designed for the prefill role. Early system views emphasize that CPX is positioned as a rack-level building block rather than a standalone add-in card, aligning cooling loops, power delivery, and network fabric with the split topology. Prefill nodes can run at tighter thermal envelopes, while NVLink and the broader fabric are prioritized for decode traffic where bandwidth remains the limiter (ServeTheHome).

For buyers, the system story is the point: fewer HBM dollars where bandwidth doesn’t dominate, and a fabric plan that preserves latency on decode while scaling prefill elastically.

Operator impact: scheduling, SLOs, and utilization

Rubin CPX changes resource planning for long-context services. Prefill-heavy jobs can be steered to CPX trays, with decode capacity sized to protect tail latency. The practical question becomes ratio-setting: how much prefill relative to decode given context length distributions, batch sizes, and service-level objectives.

Higher FLOPS-per-dollar for context builds by shifting to GDDR7 and simpler packaging
Lower dependence on HBM supply, reducing memory ASP exposure at rack scale
Denser prefill nodes under liquid cooling, with fabrics prioritized for decode

Two cautions follow. First, underprovisioning HBM decode nodes will inflate tail latency even if prefill is abundant. Second, long-context prompts drive steady KV-cache growth, so schedulers must avoid backplane contention by batching and routing with cache movement in mind. Teams should expect to expose CPX-aware placement controls in their inference stacks and to track utilization and token-latency histograms to tune the prefill/decode ratio over time.

Economics: perf/W, cost structure, and where savings accrue

The economic thesis is straightforward. GDDR7 reduces memory $/GB versus HBM and sidesteps interposers and HBM stack attach. That lowers package cost, increases assembly throughput, and concentrates silicon budget on compute and register files—the elements that move prefill math. For long-context inference, the result is higher perf/W and FLOPS-per-dollar on the phase where most cycles are spent, with decode preserved on HBM for latency and bandwidth.

In practice, savings show up in three places: the memory bill of materials (more GDDR7, fewer HBM stacks purchased per unit of prefill work), cooling (liquid loops right-sized to lower prefill node heat flux), and network topology (fabrics aligned to KV-cache and activation movement rather than brute-force all-to-all). NVIDIA’s own framing and third-party system analyses converge on the same point: CPX complements, rather than replaces, bandwidth-first decode parts, and that complement unlocks rack-level efficiency gains without sacrificing end-to-end latency (NVIDIA developer blog).

Supply chain and capacity considerations

HBM availability and pricing have constrained many AI build-outs. CPX’s GDDR7 orientation opens a parallel supply channel through commodity DRAM ecosystems and mainstream OSAT flows. That spreads risk away from HBM stack yields and interposer capacity, letting buyers scale prefill capacity even when HBM remains the bottleneck. The implication is a more elastic capacity curve: prefill can expand with demand and budget, while decode stays gated by HBM and fabric availability (SemiAnalysis).

Coordination still matters. If workloads skew shorter, CPX utilization will dip; if contexts run longer than planned, decode pressure rises. These are planning problems, not architectural flaws, and they push operators toward finer-grained telemetry, adaptive batching, and explicit prefill routing in their schedulers.

Adoption outlook: where Rubin CPX lands first

Early deployments will target products where context length is a differentiator: code assistants over large repositories, retrieval-augmented systems with deep histories, and multimodal agents that span hour-scale audio/video. As pilots demonstrate utilization and scheduler maturity, hyperscalers will tune their CPX-to-decode ratios by observed context distributions and SLOs. The near-term draw is obvious: lower HBM exposure during prefill, better perf/W where the cycles accrue, and a rack SKU that mirrors how long-context inference actually runs at scale (SemiAnalysis).

Looking ahead, expect CPX-aware scheduling features to surface in mainstream inference stacks and for cloud providers to expose instance types or cluster profiles that call out prefill capacity explicitly. As second-wave racks arrive and operators gain confidence in utilization models, prefill-optimized nodes should take a growing share of incremental inference capex for long-context services. If HBM supply tightness persists beyond initial rollouts, CPX-like designs will likely propagate across vendors, cementing a two-tier topology where compute-dense GDDR nodes and bandwidth-dense HBM nodes scale semi-independently.