OpenAI Custom Chip Slashes LLM Costs by 50%

OpenAI Jalapeño: Broadcom-Designed ASIC Halves LLM Inference Costs On June 24, 2026, OpenAI and Broadcom officially unveiled “Jalapeño,” an artificial intelligence processor designed strictly for large language model (LLM) inference. The emergence of this OpenAI custom chip represents a physical shift in the economics of generative AI, moving the industry away from general-purpose graphics processing…

openai custom chip

OpenAI Jalapeño: Broadcom-Designed ASIC Halves LLM Inference Costs

On June 24, 2026, OpenAI and Broadcom officially unveiled “Jalapeño,” an artificial intelligence processor designed strictly for large language model (LLM) inference. The emergence of this OpenAI custom chip represents a physical shift in the economics of generative AI, moving the industry away from general-purpose graphics processing units. By tailoring the silicon directly to the mathematical structures of modern transformers, the partnership aims to bypass the memory and compute inefficiencies that drive up the cost of running frontier models. With engineering samples already operating at target power levels in San Francisco testbeds, the silicon is now preparing for large-scale data center deployment.

Key Takeaways

  • The OpenAI custom chip, Jalapeño, utilizes a custom 3-nanometer TSMC process to optimize memory access patterns for autoregressive decoding, bypassing general-purpose GPU overhead.
  • Early engineering samples running the GPT-5.3-Codex-Spark model indicate a 50% reduction in inference cost compared to standard graphics hardware.
  • Broadcom co-designed the processor’s specialized Ethernet interconnect and high-bandwidth memory interfaces to resolve the standard latency bottlenecks of scale-out clusters.
  • The platform shifts OpenAI’s hardware strategy toward vertical integration, reducing its long-term financial exposure to Nvidia’s margin premium.

Architecture & Packaging

Architecture & Packaging

Built on TSMC’s 3-nanometer (N3) process node, Jalapeño transitioned from initial architectural design to manufacturing tape-out in only nine months. This rapid timeline indicates how automated design tooling has compressed standard semiconductor development cycles. This OpenAI custom chip employs a massive compute chiplet flanked by high-bandwidth memory (HBM) stacks, filling a significant portion of the standard 858 mm² reticle limit. Rather than utilizing a monolithic die, the chiplet architecture splits processing and input/output functions to maximize physical silicon yields on advanced lithography nodes.

Unlike typical GPUs that dedicate substantial die area to double-precision floating-point (FP64) execution units and rasterization pipelines, Jalapeño strips out all legacy graphics logic. Its silicon layout focuses entirely on low-precision matrix operations, specifically optimized for FP8, FP4, and narrow integer formats. This design choice saves critical die area. The saved space is reallocated to large, on-chip static random-access memory (SRAM) cache banks. This large local cache allows model weights and key-value tensors to remain physically closer to the arithmetic units, reducing the energy penalty of off-chip data transfers.

The OpenAI custom chip resolves the primary physical limit of transformer computation—memory bandwidth—by integrating 144 gigabytes of high-bandwidth memory (HBM3E) on a silicon interposer using TSMC’s Chip-on-Wafer-on-Substrate (CoWoS) packaging. This 2.5D packaging format allows the compute die and HBM stacks to communicate across an ultra-dense, microscopic wiring layer. The physical distance between the processor and memory drops to fractions of a millimeter, reducing signal degradation and transmission power. This tight integration ensures that the physical memory interface is optimized specifically for the rapid data movement required by generative tasks.

Broadcom integrated its proprietary PCIe Gen 6 interfaces and custom low-latency Ethernet switching tiles directly into the chip’s input/output die. Instead of relying on external network interface cards, Jalapeño nodes connect directly to a high-speed backplane. This direct integration eliminates multiple protocol conversion steps, shaving critical nanoseconds off inter-node communications. For multi-billion parameter models partitioned across multiple chips, this localized interconnect architecture ensures that communication latencies do not stall the computation pipeline.

The design team also implemented custom, programmable sequence decoders. These hardwired units manage the complex token-generation cycles of generative models without requiring constant instruction updates from a host CPU. By scheduling memory transactions directly at the hardware layer, the chip prevents the control-flow overhead that typically degrades GPU efficiency during low-batch-size inference workloads.


Perf/W & Benchmarks

Perf/W & Benchmarks

Broadcom CEO Hock Tan confirmed that early laboratory runs of Jalapeño running the GPT-5.3-Codex-Spark model achieved the target power levels while matching the performance thresholds of Nvidia’s Blackwell architecture. This benchmarking data represents a major milestone for custom application-specific integrated circuits (ASICs). While general-purpose processors struggle with the memory-bound nature of autoregressive generation, this OpenAI custom chip achieves its efficiency by aligning hardware registers with model kernel behaviors. Under standard operation, the chip limits the massive power spikes associated with constant DRAM activation.

The core performance metric in LLM serving is “tokenomics,” which measures the cost and power required to generate a single token. Autoregressive decoding requires the model to read the entire active context and generate one word at a time, making it an intensely memory-bound operation. The OpenAI custom chip implements hardware-level Key-Value (KV) cache compression and dynamic memory tiling. This ensures that the chip does not reload unchanged weights from the HBM stacks for every sequential token, a common inefficiency in generic hardware.

To understand how these engineering decisions translate into raw specifications, we can compare the physical and electrical configurations of the first-generation Jalapeño processor against dominant market alternatives in mid-2026:

Specification / MetricOpenAI Jalapeño (ASIC)Nvidia Blackwell B200 (GPU)AWS Trainium3 (ASIC)
Process NodeTSMC 3nm (N3)TSMC 4NP (Custom 5nm)TSMC 3nm (N3)
Memory Capacity144GB HBM3E192GB HBM3E144GB HBM3E
Memory Bandwidth5.5 TB/s (Target)8.0 TB/s4.9 TB/s
Primary WorkloadLLM InferenceGeneral AI Training/InferenceAI Training & Inference
Interconnect FabricCustom Broadcom EthernetNVLink 5 (1.8 TB/s)Custom Trainium Fabric
Inference Cost Delta-50% vs GPU baselineBaseline (100%)-40% vs GPU baseline

The absolute performance gains become visible during multi-user serving scenarios. When scaling concurrent user queries, typical GPUs experience sharp latency increases once the KV cache exceeds local SRAM limits. Because the OpenAI custom chip was designed with ChatGPT’s specific query distributions in mind, its memory controllers partition cache dynamically. This memory partitioning keeps execution queues full, preventing the processor idle states that waste power in general-purpose compute clusters.

In real-world testing, these hardware adaptations translate to a 50% drop in cost per output token. The performance-per-watt metric is optimized because the silicon does not waste power driving clock cycles for unused compute blocks. To achieve similar efficiency, other players have tried various software workarounds, but hardware-level memory path optimization remains the superior path. The strategic deployment of such specialized memory layouts mirrors other major hardware partnerships, such as the high-bandwidth integrations explored in the Micron Anthropic AI memory partnership.

By implementing direct hardware support for low-precision data formats, the OpenAI custom chip maintains model accuracy while scaling throughput. For example, its custom matrix math execution units run mixed-precision FP8 and FP4 computations without requiring inline casting. This hardware-native quantization eliminates compiler overhead, allowing OpenAI’s serving software to feed input sequences directly to the execution pipeline without conversion delays.


Yield, Cost, and Capacity

Yield, Cost, and Capacity

The strategic partnership between OpenAI and Broadcom targets the deployment of up to 10 gigawatts of custom AI accelerators over the next five years. This immense power footprint highlights the scale of the capital expenditures required to compete in the frontier model space. However, physical infrastructure cost is determined by wafer starts and packaging yields. The OpenAI custom chip must navigate the tight capacity limits of TSMC’s advanced packaging lines to achieve meaningful deployment volume.

For a chip of Jalapeño’s physical scale, wafer yields are the primary cost driver. Fabricating on TSMC’s 3-nanometer line carries an estimated cost of $20,000 per wafer start. Assuming a die size of approximately 750 mm², a single 300mm wafer yields roughly 80 potential chips. If we assume a mature defect density rate of 0.1 defects per square centimeter, the physical yield of fully functional dies sits near 65%. Any packaging defect introduced during the subsequent CoWoS step further degrades this yield, driving up the effective cost of each completed processor assembly.

Despite these manufacturing risks, the financial incentive to deploy this custom hardware is clear. Nvidia’s gross margins on its data center hardware routinely exceed 75%, reflecting its near-monopoly pricing power. By deploying the OpenAI custom chip, OpenAI bypasses this high premium. The estimated manufacturing cost of a completed Jalapeño server node is less than $12,000, whereas an equivalent Nvidia-based system commands retail prices exceeding $38,000. For an organization serving hundreds of millions of active users daily, this capital expenditure reduction directly alters the balance sheet.

How does OpenAI plan to scale this capacity? The company will rely on Microsoft and other cloud hosting partners to integrate the processors into their physical infrastructure. Under this model, OpenAI funds the chip design and silicon fabrication costs, while Microsoft provides the physical real estate, power distribution, and server racks. This division of labor allows OpenAI to scale its hardware footprint without taking on the massive debt of building raw utility substations from scratch.

Furthermore, thermal management plays a critical role in data center operating costs. While high-power GPUs require massive liquid-to-air heat exchangers, the highly optimized power draw of this OpenAI custom chip reduces the cooling infrastructure burden. Implementing targeted thermal solutions is a major focus across modern AI clusters, as discussed in our analysis of Nvidia’s liquid-cooling developments for high-density servers. By keeping the power envelope of individual inference nodes under 450 watts, OpenAI can utilize conventional air-cooling systems in existing facilities, avoiding the costly retrofitting required for higher-power hardware.

To maximize packaging yields, Broadcom’s engineering team implemented a reduntant core architecture. If a microscopic particle damages a local matrix-multiply block during fabrication, the chip’s internal firmware can permanently disable that block and reroute the compute threads to adjacent units. This fault-tolerant layout dramatically improves the usable yield per wafer, lowering the amortized cost per chip and shielding OpenAI from the steep pricing fluctuations of the merchant silicon market.


Supply Chain Dynamics

Supply Chain Dynamics

Electronics manufacturing services giant Celestica has been selected as the exclusive system integrator to build the server chassis and rack systems for Jalapeño. This supply chain agreement shifts the assembly and validation workload away from traditional tier-one server vendors. The delivery pipeline for this OpenAI custom chip relies on a delicate network of international suppliers, making it highly sensitive to manufacturing disruptions and material shortages.

The silicon supply chain is highly concentrated. After OpenAI’s engineering teams finalize the layout, the designs are handed to Broadcom, which acts as the physical design integrator. Broadcom utilizes its extensive IP library, packaging expertise, and established wafer allocation at TSMC to guide the chip through the manufacturing tape-out process. Once TSMC fabricates the 3nm wafers, they must undergo advanced CoWoS packaging at TSMC’s specialized backend facilities in Taiwan. Any bottleneck at these packaging facilities directly delays the physical availability of the processors.

The physical movement of the OpenAI custom chip through the global supply chain follows a precise sequence:
Design & Verification: OpenAI uses in-house models to optimize chip layout.
IP Integration: Broadcom integrates physical layer IP and networking tiles.
Lithography: TSMC fabricates the compute dies on N3 wafers.
Advanced Packaging: TSMC performs CoWoS 2.5D integration with HBM3E memory.
System Assembly: Celestica builds the custom server chassis.
Deployment: Microsoft Azure Data Centers host the operational racks.

This multi-step supply chain operates in a highly competitive market for custom silicon. OpenAI is not the only software company building proprietary chips. Google has scaled its TPU infrastructure, Microsoft continues to deploy its Maia series, and Meta is rolling out its MTIA processors to reduce reliance on third-party silicon. This collective rush toward ASICs has placed immense pressure on TSMC’s packaging capacity, forcing chip designers to bid aggressively for limited production slots.

To bypass the hardware constraints, software optimization has become a critical battleground. Compilers must translate high-level Python code into optimized machine instructions that match the exact physical paths of the new chip. If the compiler cannot schedule memory transfers efficiently, the custom silicon’s hardware advantages are lost. OpenAI’s hardware team has designed the OpenAI custom chip to natively support Triton, an open-source programming language that simplifies the writing of highly parallel code. This software-hardware co-design mirrors other industry efforts to build alternative software ecosystems, such as those analyzed in our report on Qualcomm and Modular’s joint software acquisitions.


Forward Vector

Initial field deployments of Jalapeño are scheduled to go live in Microsoft Azure data centers by the end of 2026. This deployment schedule leaves a narrow window for OpenAI’s software teams to finalize the compiler stacks and validate the hardware under real-world traffic patterns. The immediate commercial viability of this OpenAI custom chip depends on how smoothly these initial systems handle the live inference loads of millions of active ChatGPT users.

Over the next 6 to 18 months, several critical checkpoints will determine the success of this silicon transition:
Q4 2026: Delivery of the first production-run silicon from TSMC to Celestica for system-level integration.
Q1 2027: First phase of live traffic offloading, with Jalapeño clusters handling a target of 15% of standard API queries.
Q2 2027: Release of detailed benchmark reports comparing Jalapeño’s performance on non-transformer architectures.
Q3 2027: Tape-out of the second-generation “Titan” chip on TSMC’s advanced A16 (1.6nm) process node.

The primary technical risk during this rollout is the rapid pace of model architecture evolution. While Jalapeño was designed to be a flexible device capable of addressing future LLM innovations, radical departures from the standard transformer attention mechanism could render some hardwired silicon paths obsolete. If the research community shifts entirely toward alternative architectures like state-space models or liquid neural networks, the chip’s specialized matrix blocks may operate at reduced efficiency.

Strategically, the OpenAI custom chip alters OpenAI’s relationship with its primary hardware suppliers. While the company will continue to purchase Nvidia GPUs for its massive, multi-month training runs, it can now use Jalapeño as a powerful bargaining chip. By demonstrating a viable, high-volume alternative for inference, OpenAI can negotiate lower volume pricing for Nvidia’s Blackwell and Next-Gen platforms. This multi-vendor approach, combined with its usage of AMD Instinct GPUs and AWS Trainium clusters, establishes a highly diversified computing foundation.

Ultimately, the launch of Jalapeño demonstrates that software-hardware co-design is no longer optional for frontier AI labs. The ability to modify silicon registers based on model behavior provides a physical efficiency advantage that general-purpose processors cannot match. If the production yields hold steady and the compiler software matures, this custom processor will serve as the financial engine that makes high-throughput, agentic AI economically viable at a global scale.


Frequently Asked Questions

What is the openai custom chip, and why did OpenAI build it?

The OpenAI custom chip, codenamed Jalapeño, is a specialized application-specific integrated circuit (ASIC) co-designed with Broadcom to run large language model inference. OpenAI developed this proprietary hardware to bypass the high cost and memory bottlenecks associated with running ChatGPT and its APIs on general-purpose Nvidia graphics processors.

How does Jalapeño achieve a 50% cost reduction compared to GPUs?

Jalapeño cuts operational costs by stripping away the legacy graphics and high-precision math pipelines found in standard GPUs, reallocating that silicon area to massive local SRAM caches and custom high-bandwidth memory interfaces. This hardware specialization optimizes the memory-bound “autoregressive decoding” phase of LLM generation, reducing power-wasting DRAM transfers.

When will the openai custom chip be deployed in active data centers?

OpenAI and Broadcom are targeting the initial physical deployment of the Jalapeño processor in partner data centers, primarily Microsoft Azure, by the end of 2026. The server systems are being assembled by Celestica, with a second-generation chip codenamed Titan already planned on TSMC’s A16 (1.6nm) process node.


References

  • [1] Quartz: “OpenAI is unveiling its first custom-built AI chip, designed with Broadcom.” Published June 24, 2026.
  • [2] Tom’s Hardware: “Broadcom and OpenAI unveil custom-built Jalapeño inference processor.” Published June 24, 2026.
  • [3] VentureBeat: “OpenAI unveils first custom AI inference chip, Jalapeño, with Broadcom.” Published June 24, 2026.
  • [4] Reuters: “OpenAI unveils custom chip it designed with Broadcom to boost its AI infrastructure.” Published June 24, 2026.
  • [5] TrendForce: “OpenAI Reportedly to Deploy Custom AI Chip on TSMC N3 by End-2026.” Published January 15, 2026.
  • [6] Tom’s Hardware: “The custom AI ASIC state of play (May 2026) — Broadcom deals, Google TPUs, Meta MTIA & beyond.” Published May 21, 2026.
  • [7] Bloomberg Technology: “OpenAI Unveils First Custom AI Chip With Broadcom.” Broadcast June 24, 2026.
  • [8] Engadget: “Jalapeño is the first AI chip from OpenAI and Broadcom.” Published June 24, 2026.
Share


X / Twitter



LinkedIn


Copied!