Composition turns parity-shards into a 1.79× throughput win.
Two AI PCs serve faster than one at every concurrency level we tested.
Parity at 1 PC. Distributed wins at 2.
Llama 3.1 8B INT4, single-stream. Cascadia's v5_beam single-stage shard reaches 99.6% of openvino_genai's C++ runtime on the same hardware — parity. With async overlap pipelining stage compute and TCP transmission, two AI PCs together exceed mono throughput at 1.21× — distributed beats monolithic at single user, before any micro-batching or multi-user composition.
* Async overlap measured on 4096-token context · K=5 speculative decode.
Why parity, then composition.
Same pattern. Different model.
The pattern isn't Llama-specific. Gemma 4 E2B (5.1B parameters, 35 layers, FP32) reproduces the same shape on the v2_beam export: 1-stage shard at 13.98 tok/s, 2-stage localhost at 91% of single-node, multi-node distributed at 10.40 tok/s.
Where the 61 ms actually goes.
Network cost accounts for just 10% of per-token time.
Compute > network, on LAN.
Stage compute dominates at 69% of per-token time. TCP round-trip accounts for 10%. The remaining 21% is Rust plus dispatch overhead — the layer most amenable to optimization, and the one with the most headroom still on the table.
Faster silicon compounds throughput linearly. TCP overhead remains the smallest part of the equation.
Filling pipeline bubbles with concurrent requests.
A 2-stage pipeline is 50% utilized on a single request — while stage 1 computes, stage 0 sits idle. Two InferRequests per shard, each with its own KV cache, interleave two users through the pipeline. No model changes, no memory management tricks: the coordinator alternates whose turn it is.
Frequently asked questions
/v1/chat/completions client at Cascadia and it works, no vendor SDK required. A Rust crate (cascadia) exposing the same primitives as the CLI (export, fleet, worker, serve) for build-pipeline and runtime use is on the roadmap.
