Runtime Architecture

A distributed inference runtime for Intel AI PC fleets.

Cascadia exports any HuggingFace transformer into self-contained OpenVINO IR shards, distributes them across your fleet, and coordinates inference through a raw TCP pipeline. Four components, one protocol.

Request a pilot →See it run →

Ships with support for

Llama

1B – 70B

Gemma

2B – 27B

Mistral

7B · 8×7B

Qwen

1.5B – 72B

Phi

3.8B – 14B

DeepSeek

7B – 67B

Kimi

K2 · 1T

+ HuggingFace transformers

Compile once. Distribute to the fleet. Serve on-prem.

Cascadia exports any HuggingFace transformer into self-contained OpenVINO IR shards, one per pipeline stage, quantized INT4. Deploy a shard to each Intel AI PC in your fleet; a coordinator chains them over your LAN.

Dig into the benchmarks →

Model-agnostic export.

torch.jit.traceOpenVINO IRINT4 NNCF

A model-agnostic structure mapping covers Llama, Mistral, Qwen, Phi, Gemma, and DeepSeek families. Decoder layers load from safetensors; attention is rewritten with precomputed rotary embeddings for clean tracing.

Stateful pipeline runtime.

TCPTCP_NODELAY6.5 ms RTT

Each shard compiles with ReadValue / Assign ops backing its own KV cache. The coordinator drives the autoregressive loop; activations flow over TCP at 16 KB per hop with a 20-byte header.

Multi-tenant scheduling.

Per-request statestall-free decode

Independent InferRequests carry per-user KV state inside the same compiled graph, interleaving decode steps across stages. Throughput scales 1.38–1.66× per node without hardware changes.

Four components, one pipeline.

The system is deliberately simple. The protocol lives in the export format, not the runtime. Build once, run anywhere Intel silicon is available.

Export pipeline

Converts a HuggingFace transformer to INT4 OpenVINO IR shards. Runs once on your build machine.

HF → IR · NNCF · INT4

Workers

One per AI PC in the fleet. Loads its assigned shard and exposes a TCP endpoint.

TCP · 16 KB per hop

Shards

Self-contained compiled graphs, one per pipeline stage. Include stateful KV cache via ReadValue / Assign.

OpenVINO IR · stateful

Coordinator

Drives the autoregressive loop. Tokenizes input, runs stage 0, chains hidden states through the fleet.

Runs on one node · stateless relay

From HuggingFace handle to compiled shard, in four steps.

The per-stage export is the key technical contribution. Tracing standard transformer attention defeats PyTorch's export toolchain; Cascadia sidesteps it by rewriting attention with precomputed rotary embeddings and explicit KV tensors.

Decoder layers load from safetensors via a model-agnostic structure mapping. Each stage gets only its assigned layers — memory usage scales with layer count, not model size.
LlamaMistralQwenPhiGemmaDeepSeek
HuggingFace's DynamicCache and LlamaRotaryEmbedding aren't traceable. Cascadia replaces them with a manual attention implementation taking precomputed cos/sin tensors and explicit KV tensors as inputs.
Max numerical diff: 5×10⁻⁷ vs. native forward
Each stage is traced with torch.jit.trace using real tensors. The traced graph converts to OpenVINO IR via ov.convert_model. KV cache tensors become stateful ReadValue / Assign pairs.
torch.jit.traceov.convert_model
INT4 symmetric weight compression via NNCF, group size 128. Each shard is a self-contained IR — XML graph plus BIN weights — taking token IDs or hidden states in, producing hidden states or logits out.
NNCFgroup 128INT4 symmetric

A coordinator, a fleet, and a TCP socket.

The coordinator tokenizes input, runs stage 0 locally, and forwards hidden states to the next node. KV cache lives inside each compiled shard. Transport is raw TCP and runs on whatever your fleet already uses: WiFi, gigabit Ethernet, or Thunderbolt 4. The protocol is a thin swap point for RDMA and compression when fleet topology demands them.

Parity shards plus composition turn distribution into a 1.79× throughput win on two AI PCs.

— Llama 3.1 8B, 2-stream + K=3 spec decode, two-node fleet

Fleet

modelmeta-llama/Llama-3.1-8B

node-00

Lunar Lake · 258V · layers 0–15

21.99 tok/s

node-01

Panther Lake · 358H · layers 16–31

21.99 tok/s

FLEET TOTAL

43.97tok/s

From a HuggingFace handle to a serving fleet, in four commands.

Works on Windows 11 with Rust 1.95+ stable and OpenVINO 2026.1. Stock drivers; no kernel modules, no CUDA, no cloud credentials. The CLI wraps the export pipeline, fleet provisioning, and coordinator startup.

Silicon

Intel Core Ultra

Windows 11 · 23H2+

Drivers

Stock · no kernel mods

Runtime

OpenVINO 2026.1

Toolchain

Rust 1.95+ stable

Network

TCP/IP · LAN

cascadia — zsh

# 1 · Export any HuggingFace decoder to INT4 OpenVINO shards
$ cascadia export meta-llama/Llama-3.1-8B \
    --stages 2 --precision int4 --out ./shards
  ✓ traced layers 0–15, 16–31 (INT4, group 128)
  ✓ stateful KV cache bound via ReadValue / Assign

# 2 · Push shards to the fleet
$ cascadia fleet push ./shards \
    node-00:stage-0 node-01:stage-1

# 3 · Start a worker on each node
$ cascadia worker start --shard /shards/stage-X

# 4 · Serve
$ cascadia serve --fleet node-00,node-01 --port 8080
  ready → 43.97 tok/s aggregate · 2-stream + spec K=3 · 2 workers

Frequently asked questions

How do I choose the number of stages?

You don't — Cascadia resolves it automatically from the selected model and the memory available across the fleet. The default is the minimum number of stages that will fit the model across a distributed pipeline, which also lands at the throughput sweet spot: each extra TCP hop adds decode latency, and benchmarks show 4-stage splits introduce dispatch overhead that erodes the per-stage gain.

What happens if a node dies mid-generation?

In the current preview, a failed worker aborts the request — the coordinator doesn't re-route around it. Node health and automatic failover are on the roadmap. Until then, deployments should size fleet capacity with worker redundancy at the application layer. See the known limits for the full list.

Can I mix GPU and NPU nodes in a pipeline?

Not yet, cleanly. The Panther Lake NPU (50 TOPS) rejects OpenVINO's ReadValue ops, which back the stateful KV cache. Non-stateful NPU inference works but requires full-sequence recomputation each token, which nets slower than GPU despite the TOPS advantage. Heterogeneous GPU/NPU dispatch is tracked against Intel's OpenVINO stateful-NPU support.

What's the SDK surface beyond the CLI?

An OpenAI-compatible HTTP API over the coordinator ships first — point any existing /v1/chat/completions client at Cascadia and it works, no vendor SDK required. A Rust crate (cascadia) exposing the same primitives as the CLI (export, fleet, worker, serve) for build-pipeline and runtime use is on the roadmap.

How do I upgrade shards across the fleet?

Shards push live — no fleet stop required. The coordinator stages the new shard set across nodes, drains in-flight requests from the old pipeline, and cuts traffic over once every stage reports ready. Every node still has to run the same OpenVINO version, Cascadia version, and shard format; the coordinator won't negotiate mixed stacks, so runtime upgrades remain all-or-nothing.