Benchmarks & Methodology

Composition turns parity-shards into a 1.79× throughput win.

Two AI PCs serve faster than one at every concurrency level we tested.

Request a pilot →See the benchmarks

Testbed

SiliconLunar Lake (258V) · Panther Lake (358H)

ModelsLlama 3.1 8B · Gemma 4 E2B

Network802.11ax WiFi · 6 ms RTT

RuntimeOpenVINO 2026.1 · Windows 11 · Rust 1.95+ stable

Parity at 1 PC. Distributed wins at 2.

Llama 3.1 8B INT4, single-stream. Cascadia's v5_beam single-stage shard reaches 99.6% of openvino_genai's C++ runtime on the same hardware — parity. With async overlap pipelining stage compute and TCP transmission, two AI PCs together exceed mono throughput at 1.21× — distributed beats monolithic at single user, before any micro-batching or multi-user composition.

Explore the architecture →

Single-stream throughput · Llama 3.1 8B · INT4 · Panther Lake GPU

openvino_genai (C++)monolithic reference

24.54tok/s1.00×

openvino_genai (Python)monolithic compile_model loop

23.12tok/s0.94×

Cascadia v5_beam, 1-stageparity with C++ baseline

24.45tok/s1.00×

Cascadia fleet + async overlap*distributed, single-stream, TCP pipelined with compute

29.62tok/s1.21×

Why parity, then composition.

Monolithic INT4 is 4.1 GB; a 2-stage shard is 1.3–1.8 GB. Smaller working sets fit better in L3 — and crucially, leave idle compute time inside the per-token decode that micro-batching can fill with a second stream.

Parity shards composed with 2-stream micro-batching (1.80×) and K=3 speculative decoding (1.50×) reach 43.97 tok/s aggregate on two AI PCs, 1.79× the monolithic single-user baseline. The throughput story isn't shards beating mono — it's parity shards composing with techniques mono can't run.

Stage 1 reads the previous token's hidden state from the wire while stage 0 produces the next — TCP and compute pipeline instead of running in series. Two AI PCs serve one user faster than one AI PC can: 29.62 tok/s on Llama 3.1 8B at 1.21× the monolithic baseline. No micro-batching, no concurrent users — just overlapping the network with compute.

Same pattern. Different model.

The pattern isn't Llama-specific. Gemma 4 E2B (5.1B parameters, 35 layers, FP32) reproduces the same shape on the v2_beam export: 1-stage shard at 13.98 tok/s, 2-stage localhost at 91% of single-node, multi-node distributed at 10.40 tok/s.

1-stage KV-cachedsingle-node GPU baseline

13.98 tok/s

71 ms

Reference baseline for Gemma 4 E2B (FP32, v2_beam). ±0.35 across runs.

2-stage localhostGPU → GPU, same machine

12.78 tok/s

76 ms

91% of single-node; small in-process graph-dispatch cost.

2-stage multi-nodeGPU → GPU, over WiFi

10.40 tok/s

94 ms

+28% over the v1 export (8.12 → 10.40) on the same hardware after the rotary fix.

Where the 61 ms actually goes.

Network cost accounts for just 10% of per-token time.

Compute > network, on LAN.

Stage compute dominates at 69% of per-token time. TCP round-trip accounts for 10%. The remaining 21% is Rust plus dispatch overhead — the layer most amenable to optimization, and the one with the most headroom still on the table.

Faster silicon compounds throughput linearly. TCP overhead remains the smallest part of the equation.

One token · 2-stage distributed · WiFi

61ms

33%

36%

21%

10%

Stage 0 compute20 ms · coordinator-side (Panther Lake)

Stage 1 compute22 ms · worker-side (Lunar Lake)

Rust + dispatch13 ms · mask, serialization, scheduling

TCP round trip6 ms · WiFi 802.11ax

Filling pipeline bubbles with concurrent requests.

A 2-stage pipeline is 50% utilized on a single request — while stage 1 computes, stage 0 sits idle. Two InferRequests per shard, each with its own KV cache, interleave two users through the pipeline. No model changes, no memory management tricks: the coordinator alternates whose turn it is.

Llama 3.1 8BBalanced stages · 20 / 22 ms

1.80×

gain

Single-streamone user at a time

16.33tok/s

2-streamtwo users interleaved

29.34tok/s

Gemma 4 E2BAsymmetric stages · 26 / 42 ms

1.57×

gain

Single-streamone user at a time

10.40tok/s

2-streamtwo users interleaved

16.30tok/s

1.79×

Compose 2-stream micro-batching with K=3 speculative decoding and the two-node pipeline reaches 43.97 tok/s aggregate on Llama 3.1 8B — 1.79× the monolithic single-user baseline. Pipeline parallelism is a net throughput win when parity shards compose with techniques monolithic inference can't run.

Frequently asked questions

How do I choose the number of stages?

You don't — Cascadia resolves it automatically from the selected model and the memory available across the fleet. The default is the minimum number of stages that will fit the model across a distributed pipeline, which also lands at the throughput sweet spot: each extra TCP hop adds decode latency, and benchmarks show 4-stage splits introduce dispatch overhead that erodes the per-stage gain.

What happens if a node dies mid-generation?

In the current preview, a failed worker aborts the request — the coordinator doesn't re-route around it. Node health and automatic failover are on the roadmap. Until then, deployments should size fleet capacity with worker redundancy at the application layer. See the known limits for the full list.

Can I mix GPU and NPU nodes in a pipeline?

Not yet, cleanly. The Panther Lake NPU (50 TOPS) rejects OpenVINO's ReadValue ops, which back the stateful KV cache. Non-stateful NPU inference works but requires full-sequence recomputation each token, which nets slower than GPU despite the TOPS advantage. Heterogeneous GPU/NPU dispatch is tracked against Intel's OpenVINO stateful-NPU support.

What's the SDK surface beyond the CLI?

An OpenAI-compatible HTTP API over the coordinator ships first — point any existing /v1/chat/completions client at Cascadia and it works, no vendor SDK required. A Rust crate (cascadia) exposing the same primitives as the CLI (export, fleet, worker, serve) for build-pipeline and runtime use is on the roadmap.

How do I upgrade shards across the fleet?

Shards push live — no fleet stop required. The coordinator stages the new shard set across nodes, drains in-flight requests from the old pipeline, and cuts traffic over once every stage reports ready. Every node still has to run the same OpenVINO version, Cascadia version, and shard format; the coordinator won't negotiate mixed stacks, so runtime upgrades remain all-or-nothing.