A distributed inference runtime for Intel AI PC fleets.
Cascadia exports any HuggingFace transformer into self-contained OpenVINO IR shards, distributes them across your fleet, and coordinates inference through a raw TCP pipeline. Four components, one protocol.

Compile once. Distribute to the fleet. Serve on-prem.
Cascadia exports any HuggingFace transformer into self-contained OpenVINO IR shards, one per pipeline stage, quantized INT4. Deploy a shard to each Intel AI PC in your fleet; a coordinator chains them over your LAN.
Dig into the benchmarks →Four components, one pipeline.
The system is deliberately simple. The protocol lives in the export format, not the runtime. Build once, run anywhere Intel silicon is available.
Export pipeline
Converts a HuggingFace transformer to INT4 OpenVINO IR shards. Runs once on your build machine.
HF → IR · NNCF · INT4Workers
One per AI PC in the fleet. Loads its assigned shard and exposes a TCP endpoint.
TCP · 16 KB per hopShards
Self-contained compiled graphs, one per pipeline stage. Include stateful KV cache via ReadValue / Assign.
OpenVINO IR · statefulCoordinator
Drives the autoregressive loop. Tokenizes input, runs stage 0, chains hidden states through the fleet.
Runs on one node · stateless relayFrom HuggingFace handle to compiled shard, in four steps.
The per-stage export is the key technical contribution. Tracing standard transformer attention defeats PyTorch's export toolchain; Cascadia sidesteps it by rewriting attention with precomputed rotary embeddings and explicit KV tensors.
Decoder layers load from safetensors via a model-agnostic structure mapping. Each stage gets only its assigned layers — memory usage scales with layer count, not model size.
HuggingFace's DynamicCache and LlamaRotaryEmbedding aren't traceable. Cascadia replaces them with a manual attention implementation taking precomputed cos/sin tensors and explicit KV tensors as inputs.
Each stage is traced with torch.jit.trace using real tensors. The traced graph converts to OpenVINO IR via ov.convert_model. KV cache tensors become stateful ReadValue / Assign pairs.
INT4 symmetric weight compression via NNCF, group size 128. Each shard is a self-contained IR — XML graph plus BIN weights — taking token IDs or hidden states in, producing hidden states or logits out.
A coordinator, a fleet, and a TCP socket.
The coordinator tokenizes input, runs stage 0 locally, and forwards hidden states to the next node. KV cache lives inside each compiled shard. Transport is raw TCP and runs on whatever your fleet already uses: WiFi, gigabit Ethernet, or Thunderbolt 4. The protocol is a thin swap point for RDMA and compression when fleet topology demands them.
From a HuggingFace handle to a serving fleet, in four commands.
Works on Windows 11 with Rust 1.95+ stable and OpenVINO 2026.1. Stock drivers; no kernel modules, no CUDA, no cloud credentials. The CLI wraps the export pipeline, fleet provisioning, and coordinator startup.
# 1 · Export any HuggingFace decoder to INT4 OpenVINO shards $ cascadia export meta-llama/Llama-3.1-8B \ --stages 2 --precision int4 --out ./shards ✓ traced layers 0–15, 16–31 (INT4, group 128) ✓ stateful KV cache bound via ReadValue / Assign # 2 · Push shards to the fleet $ cascadia fleet push ./shards \ node-00:stage-0 node-01:stage-1 # 3 · Start a worker on each node $ cascadia worker start --shard /shards/stage-X # 4 · Serve $ cascadia serve --fleet node-00,node-01 --port 8080 ready → 43.97 tok/s aggregate · 2-stream + spec K=3 · 2 workers
Frequently asked questions
ReadValue ops, which back the stateful KV cache. Non-stateful NPU inference works but requires full-sequence recomputation each token, which nets slower than GPU despite the TOPS advantage. Heterogeneous GPU/NPU dispatch is tracked against Intel's OpenVINO stateful-NPU support./v1/chat/completions client at Cascadia and it works, no vendor SDK required. A Rust crate (cascadia) exposing the same primitives as the CLI (export, fleet, worker, serve) for build-pipeline and runtime use is on the roadmap.