Runtime Architecture

A distributed inference runtime for Intel AI PC fleets.

Cascadia exports any HuggingFace transformer into self-contained OpenVINO IR shards, distributes them across your fleet, and coordinates inference through a raw TCP pipeline. Four components, one protocol.

Ships with support for
Llama
1B – 70B
Gemma
2B – 27B
Mistral
7B · 8×7B
Qwen
1.5B – 72B
Phi
3.8B – 14B
DeepSeek
7B – 67B
Kimi
K2 · 1T
+ HuggingFace transformers

Compile once. Distribute to the fleet. Serve on-prem.

Cascadia exports any HuggingFace transformer into self-contained OpenVINO IR shards, one per pipeline stage, quantized INT4. Deploy a shard to each Intel AI PC in your fleet; a coordinator chains them over your LAN.

Dig into the benchmarks

Model-agnostic export.

torch.jit.traceOpenVINO IRINT4 NNCF

A model-agnostic structure mapping covers Llama, Mistral, Qwen, Phi, Gemma, and DeepSeek families. Decoder layers load from safetensors; attention is rewritten with precomputed rotary embeddings for clean tracing.

Stateful pipeline runtime.

TCPTCP_NODELAY6.5 ms RTT

Each shard compiles with ReadValue / Assign ops backing its own KV cache. The coordinator drives the autoregressive loop; activations flow over TCP at 16 KB per hop with a 20-byte header.

Multi-tenant scheduling.

Per-request statestall-free decode

Independent InferRequests carry per-user KV state inside the same compiled graph, interleaving decode steps across stages. Throughput scales 1.38–1.66× per node without hardware changes.

Four components, one pipeline.

The system is deliberately simple. The protocol lives in the export format, not the runtime. Build once, run anywhere Intel silicon is available.

Export pipeline

Converts a HuggingFace transformer to INT4 OpenVINO IR shards. Runs once on your build machine.

HF → IR · NNCF · INT4

Workers

One per AI PC in the fleet. Loads its assigned shard and exposes a TCP endpoint.

TCP · 16 KB per hop

Shards

Self-contained compiled graphs, one per pipeline stage. Include stateful KV cache via ReadValue / Assign.

OpenVINO IR · stateful

Coordinator

Drives the autoregressive loop. Tokenizes input, runs stage 0, chains hidden states through the fleet.

Runs on one node · stateless relay

From HuggingFace handle to compiled shard, in four steps.

The per-stage export is the key technical contribution. Tracing standard transformer attention defeats PyTorch's export toolchain; Cascadia sidesteps it by rewriting attention with precomputed rotary embeddings and explicit KV tensors.

  1. Decoder layers load from safetensors via a model-agnostic structure mapping. Each stage gets only its assigned layers — memory usage scales with layer count, not model size.

    LlamaMistralQwenPhiGemmaDeepSeek
  2. HuggingFace's DynamicCache and LlamaRotaryEmbedding aren't traceable. Cascadia replaces them with a manual attention implementation taking precomputed cos/sin tensors and explicit KV tensors as inputs.

    Max numerical diff: 5×10⁻⁷ vs. native forward
  3. Each stage is traced with torch.jit.trace using real tensors. The traced graph converts to OpenVINO IR via ov.convert_model. KV cache tensors become stateful ReadValue / Assign pairs.

    torch.jit.traceov.convert_model
  4. INT4 symmetric weight compression via NNCF, group size 128. Each shard is a self-contained IR — XML graph plus BIN weights — taking token IDs or hidden states in, producing hidden states or logits out.

    NNCFgroup 128INT4 symmetric

A coordinator, a fleet, and a TCP socket.

The coordinator tokenizes input, runs stage 0 locally, and forwards hidden states to the next node. KV cache lives inside each compiled shard. Transport is raw TCP and runs on whatever your fleet already uses: WiFi, gigabit Ethernet, or Thunderbolt 4. The protocol is a thin swap point for RDMA and compression when fleet topology demands them.

Parity shards plus composition turn distribution into a 1.79× throughput win on two AI PCs.
— Llama 3.1 8B, 2-stream + K=3 spec decode, two-node fleet
Fleet
modelmeta-llama/Llama-3.1-8B
node-00
Lunar Lake · 258V · layers 0–15
21.99 tok/s
node-01
Panther Lake · 358H · layers 16–31
21.99 tok/s
FLEET TOTAL
43.97tok/s

From a HuggingFace handle to a serving fleet, in four commands.

Works on Windows 11 with Rust 1.95+ stable and OpenVINO 2026.1. Stock drivers; no kernel modules, no CUDA, no cloud credentials. The CLI wraps the export pipeline, fleet provisioning, and coordinator startup.

Silicon
Intel Core Ultra
OS
Windows 11 · 23H2+
Drivers
Stock · no kernel mods
Runtime
OpenVINO 2026.1
Toolchain
Rust 1.95+ stable
Network
TCP/IP · LAN
cascadia — zsh
# 1 · Export any HuggingFace decoder to INT4 OpenVINO shards
$ cascadia export meta-llama/Llama-3.1-8B \
    --stages 2 --precision int4 --out ./shards
   traced layers 0–15, 16–31 (INT4, group 128)
   stateful KV cache bound via ReadValue / Assign

# 2 · Push shards to the fleet
$ cascadia fleet push ./shards \
    node-00:stage-0 node-01:stage-1

# 3 · Start a worker on each node
$ cascadia worker start --shard /shards/stage-X

# 4 · Serve
$ cascadia serve --fleet node-00,node-01 --port 8080
  ready → 43.97 tok/s aggregate · 2-stream + spec K=3 · 2 workers

Frequently asked questions

You don't — Cascadia resolves it automatically from the selected model and the memory available across the fleet. The default is the minimum number of stages that will fit the model across a distributed pipeline, which also lands at the throughput sweet spot: each extra TCP hop adds decode latency, and benchmarks show 4-stage splits introduce dispatch overhead that erodes the per-stage gain.
In the current preview, a failed worker aborts the request — the coordinator doesn't re-route around it. Node health and automatic failover are on the roadmap. Until then, deployments should size fleet capacity with worker redundancy at the application layer. See the known limits for the full list.
Not yet, cleanly. The Panther Lake NPU (50 TOPS) rejects OpenVINO's ReadValue ops, which back the stateful KV cache. Non-stateful NPU inference works but requires full-sequence recomputation each token, which nets slower than GPU despite the TOPS advantage. Heterogeneous GPU/NPU dispatch is tracked against Intel's OpenVINO stateful-NPU support.
An OpenAI-compatible HTTP API over the coordinator ships first — point any existing /v1/chat/completions client at Cascadia and it works, no vendor SDK required. A Rust crate (cascadia) exposing the same primitives as the CLI (export, fleet, worker, serve) for build-pipeline and runtime use is on the roadmap.
Shards push live — no fleet stop required. The coordinator stages the new shard set across nodes, drains in-flight requests from the old pipeline, and cuts traffic over once every stage reports ready. Every node still has to run the same OpenVINO version, Cascadia version, and shard format; the coordinator won't negotiate mixed stacks, so runtime upgrades remain all-or-nothing.