Benchmarks & Methodology

Composition turns parity-shards into a 1.79× throughput win.

Two AI PCs serve faster than one at every concurrency level we tested.

Testbed
SiliconLunar Lake (258V) · Panther Lake (358H)
ModelsLlama 3.1 8B · Gemma 4 E2B
Network802.11ax WiFi · 6 ms RTT
RuntimeOpenVINO 2026.1 · Windows 11 · Rust 1.95+ stable

Parity at 1 PC. Distributed wins at 2.

Llama 3.1 8B INT4, single-stream. Cascadia's v5_beam single-stage shard reaches 99.6% of openvino_genai's C++ runtime on the same hardware — parity. With async overlap pipelining stage compute and TCP transmission, two AI PCs together exceed mono throughput at 1.21× — distributed beats monolithic at single user, before any micro-batching or multi-user composition.

Single-stream throughput · Llama 3.1 8B · INT4 · Panther Lake GPUINT4 symmetric, group 128 · GPU execution · 20-token standardized prompt set · Max bar = 29.62 tok/s. Single-stream only — multi-user composition gains shown below.
* Async overlap measured on 4096-token context · K=5 speculative decode.
openvino_genai (C++)monolithic reference
24.54tok/s1.00×
openvino_genai (Python)monolithic compile_model loop
23.12tok/s0.94×
Cascadia v5_beam, 1-stageparity with C++ baseline
24.45tok/s1.00×
Cascadia fleet + async overlap*distributed, single-stream, TCP pipelined with compute
29.62tok/s1.21×

Why parity, then composition.

Monolithic INT4 is 4.1 GB; a 2-stage shard is 1.3–1.8 GB. Smaller working sets fit better in L3 — and crucially, leave idle compute time inside the per-token decode that micro-batching can fill with a second stream.
Parity shards composed with 2-stream micro-batching (1.80×) and K=3 speculative decoding (1.50×) reach 43.97 tok/s aggregate on two AI PCs, 1.79× the monolithic single-user baseline. The throughput story isn't shards beating mono — it's parity shards composing with techniques mono can't run.
Stage 1 reads the previous token's hidden state from the wire while stage 0 produces the next — TCP and compute pipeline instead of running in series. Two AI PCs serve one user faster than one AI PC can: 29.62 tok/s on Llama 3.1 8B at 1.21× the monolithic baseline. No micro-batching, no concurrent users — just overlapping the network with compute.

Same pattern. Different model.

The pattern isn't Llama-specific. Gemma 4 E2B (5.1B parameters, 35 layers, FP32) reproduces the same shape on the v2_beam export: 1-stage shard at 13.98 tok/s, 2-stage localhost at 91% of single-node, multi-node distributed at 10.40 tok/s.

ConfigurationHardware · Single-node rows: HP OmniBook X 16 (Core Ultra X7 358H · Panther Lake · 32 GB DDR5), GPU, FP32. Multi-node pairs the OmniBook with an ASUS Zenbook S 14 (Core Ultra 7 258V · Lunar Lake · 32 GB LPDDR5X) over 802.11ax WiFi — Windows 11 · OpenVINO 2026.1.
Throughput
Decode
Notes
1-stage KV-cachedsingle-node GPU baseline
13.98 tok/s
71 ms
Reference baseline for Gemma 4 E2B (FP32, v2_beam). ±0.35 across runs.
2-stage localhostGPU → GPU, same machine
12.78 tok/s
76 ms
91% of single-node; small in-process graph-dispatch cost.
2-stage multi-nodeGPU → GPU, over WiFi
10.40 tok/s
94 ms
+28% over the v1 export (8.12 → 10.40) on the same hardware after the rotary fix.

Where the 61 ms actually goes.

Network cost accounts for just 10% of per-token time.

Compute > network, on LAN.

Stage compute dominates at 69% of per-token time. TCP round-trip accounts for 10%. The remaining 21% is Rust plus dispatch overhead — the layer most amenable to optimization, and the one with the most headroom still on the table.

Faster silicon compounds throughput linearly. TCP overhead remains the smallest part of the equation.

One token · 2-stage distributed · WiFi
61ms
33%
36%
21%
10%
Stage 0 compute20 ms · coordinator-side (Panther Lake)
Stage 1 compute22 ms · worker-side (Lunar Lake)
Rust + dispatch13 ms · mask, serialization, scheduling
TCP round trip6 ms · WiFi 802.11ax

Filling pipeline bubbles with concurrent requests.

A 2-stage pipeline is 50% utilized on a single request — while stage 1 computes, stage 0 sits idle. Two InferRequests per shard, each with its own KV cache, interleave two users through the pipeline. No model changes, no memory management tricks: the coordinator alternates whose turn it is.

Llama 3.1 8BTheoretical max ~2.0× · faster v5_beam single-stream baseline leaves less idle time for the second stream than the prior 14.51-baseline 2.03×.Balanced stages · 20 / 22 ms
1.80×
gain
Single-streamone user at a time
16.33tok/s
2-streamtwo users interleaved
29.34tok/s
Gemma 4 E2BMore imbalance = more idle to fill · theoretical max ~1.8×.Asymmetric stages · 26 / 42 ms
1.57×
gain
Single-streamone user at a time
10.40tok/s
2-streamtwo users interleaved
16.30tok/s
1.79×
Compose 2-stream micro-batching with K=3 speculative decoding and the two-node pipeline reaches 43.97 tok/s aggregate on Llama 3.1 8B — 1.79× the monolithic single-user baseline. Pipeline parallelism is a net throughput win when parity shards compose with techniques monolithic inference can't run.

Frequently asked questions

You don't — Cascadia resolves it automatically from the selected model and the memory available across the fleet. The default is the minimum number of stages that will fit the model across a distributed pipeline, which also lands at the throughput sweet spot: each extra TCP hop adds decode latency, and benchmarks show 4-stage splits introduce dispatch overhead that erodes the per-stage gain.
In the current preview, a failed worker aborts the request — the coordinator doesn't re-route around it. Node health and automatic failover are on the roadmap. Until then, deployments should size fleet capacity with worker redundancy at the application layer. See the known limits for the full list.
Not yet, cleanly. The Panther Lake NPU (50 TOPS) rejects OpenVINO's ReadValue ops, which back the stateful KV cache. Non-stateful NPU inference works but requires full-sequence recomputation each token, which nets slower than GPU despite the TOPS advantage. Heterogeneous GPU/NPU dispatch is tracked against Intel's OpenVINO stateful-NPU support.
An OpenAI-compatible HTTP API over the coordinator ships first — point any existing /v1/chat/completions client at Cascadia and it works, no vendor SDK required. A Rust crate (cascadia) exposing the same primitives as the CLI (export, fleet, worker, serve) for build-pipeline and runtime use is on the roadmap.
Shards push live — no fleet stop required. The coordinator stages the new shard set across nodes, drains in-flight requests from the old pipeline, and cuts traffic over once every stage reports ready. Every node still has to run the same OpenVINO version, Cascadia version, and shard format; the coordinator won't negotiate mixed stacks, so runtime upgrades remain all-or-nothing.