# KostAI — LLM cost-reduction research pack

Single-file dump of everything a model needs to apply our cost-reduction findings to its own prompts. Rebuilt from `evidence/`, `docs/research/`, and the project README.

## Table of contents

- `README.md`
- `evidence/20_throughput_cost_ledger.md`
- `evidence/09_value_case.md`
- `evidence/14_determinism_ledger.md`
- `evidence/00_evidence_registry.md`
- `docs/research/2026-04-17/README.md`
- `docs/research/2026-04-17/r06-speculative-decoding.md`
- `docs/research/2026-04-17/r07-kv-cache-offload.md`
- `docs/research/2026-04-17/r08-multi-agent-verifier-swarm.md`
- `docs/research/2026-04-17/r09-streaming-early-stop.md`
- `docs/research/2026-04-17/r10-hierarchical-planner-executor.md`
- `docs/research/2026-04-17/r11-r35-ranked-prompt-pack.md`
- `docs/research/2026-04-17/r36-vqtoken-extreme-reduction.md`
- `docs/research/2026-04-17/r37-flattened-priors.md`

---

## `README.md`

# ai-cost

Local-only cost-and-waste instrumentation for LLM-powered apps — with
shadow-mode A/B testing, local-first routing, a multi-repo parallel-agent
orchestrator, and an MCP server for Claude Desktop / Claude Code.

> **🚀 v0.5.0 — BoilTheOcean orchestrator released** (2026-04-18). Multi-repo
> parallel-agent runner with per-task model routing, dynamic TABOO
> escalation, cost ledger + budget gate, bridge driver for data-sovereignty
> routing, and a strategic-brain layer that learns across waves. Measured:
> 41-task dogfood rollup on this codebase, mean +44.3% token savings vs naive
> Opus baseline. See [`docs/RELEASE_NOTES_v0.5.0.md`](docs/RELEASE_NOTES_v0.5.0.md)
> + [`docs/QUICKSTART.md`](docs/QUICKSTART.md). 30 new boil-specific tests.

> **v0.4.0** Sprint 1 rollup: router
> classifier v2 (+6.5 pts vs v1), scorer v2 with seven new detectors,
> native TLS/mTLS on the bridge, throughput-first dashboard, macOS DMG
> distribution, SQLite partitioning. Mean ~55% token reduction on the
> frozen benchmark suite. See [`docs/RELEASE_NOTES_v0.4.0.md`](docs/RELEASE_NOTES_v0.4.0.md).

> **Elastic reviewers, start here:** [`docs/ELASTIC_REVIEW.md`](docs/ELASTIC_REVIEW.md)
> is the end-to-end review and test plan (15 min single-machine walkthrough,
> two-machine bridge, production-deploy checklist, what-to-evaluate checklist).

**What it does:** wraps your LLM SDK (Anthropic, OpenAI, Google, Ollama, LM
Studio, OpenAI-compat) or slots in as an HTTP proxy, records every call,
scores it for waste across 8 categories, and — if you turn on shadow mode —
runs a cheaper/local path in parallel and grades the output so you can see
per-call what each optimized route would have saved.

Nothing leaves the machine. All data lives in a local JSONL file you can
`cat`.

## Table of contents

- [Install](#install)
- [Quick start](#quick-start)
- [Providers](#providers)
- [Shadow mode (A/B the frontier model against a cheaper one)](#shadow-mode)
- [Router (classify + downgrade simple tasks)](#router)
- [HTTP proxy (one env-var drop-in)](#http-proxy)
- [MCP server (Claude Desktop / Claude Code)](#mcp-server)
- [Bridge (multi-machine MCP — local↔frontier handoff)](#bridge)
- [Dashboard](#dashboard)
- [CLI commands](#cli-commands)
- [What it detects (waste categories)](#what-it-detects)
- [Configuration](#configuration)
- [Privacy](#privacy)
- [Extended docs](#extended-docs)

## Install

```bash
npm install @sapperjohn/kostai
# or
pnpm add @sapperjohn/kostai
```

The CLI binary is still called `ai-cost` — `npx ai-cost …` works after install.

## Quick start

### 1. Initialize

```bash
npx ai-cost init
```

Creates `ai-cost.config.json`.

### 2. Wrap your client

```ts
import Anthropic from "@anthropic-ai/sdk";
import { wrapAnthropic } from "@sapperjohn/kostai";

const client = wrapAnthropic(new Anthropic(), {
  appName: "my-app",
  route: "bugfix-agent",
});

await client.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Fix the auth bug" }],
});
```

### 3. Open the dashboard

```bash
npx ai-cost dashboard
```

Eight tabs: Overview, **Shadow Mode**, **Router**, **Local LLMs**,
**Bridge**, **Queue**, Calls, Trends. Everything runs on `http://localhost:3674`.

## Providers

```ts
import {
  wrapAnthropic,
  wrapOpenAI,
  wrapGoogle,            // @google/generative-ai
  wrapOllama,            // local Ollama HTTP client
  wrapOpenAICompat,      // LM Studio, Kimi, DeepSeek, vLLM, Moonshot
} from "@sapperjohn/kostai";
```

All wrappers use the same `wrap(client, { appName, route, workflow, tags })`
shape. Events are persisted whether the call succeeds or fails.

## Shadow mode

Every shadow run calls *both* the frontier model and a cheaper/local path in
parallel, returns the frontier result to the app, and writes a comparison
record with `baselineCostUsd`, `optimizedCostUsd`, `savedUsd`, and a Kimi-2.5
quality score (0–100).

```ts
import { runShadow, evaluateQuality, wrapAnthropic, wrapOllama } from "@sapperjohn/kostai";

const anthropic = wrapAnthropic(new Anthropic());
const ollama = wrapOllama({ baseUrl: "http://localhost:11434" });

const { baselineResult, comparison } = await runShadow({
  ask: userMessage,
  route: "ticket-classifier",
  baseline: async () => {
    const r = await anthropic.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 256,
      messages: [{ role: "user", content: userMessage }],
    });
    return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
  },
  optimized: async () => {
    const r = await ollama.chat({
      model: "llama3.2",
      messages: [{ role: "user", content: userMessage }],
    });
    return { event: { id: r._ai_cost_event_id, model: r.model, ... }, result: r };
  },
  qualityEvaluator: evaluateQuality,
});

// baselineResult flows back to the caller. The app never sees optimizedResult
// — shadow mode is read-only w.r.t. production.
```

The dashboard's **Shadow Mode** tab aggregates these as:
- total saved $
- average saved %
- average quality score
- by model pair
- by route
- recent A/B comparisons (click to see the diff)

## Router

Pure function. Given a call, classify the task, check the model, emit one
of four decisions with a USD-denominated savings estimate.

```ts
import { routeCall } from "@sapperjohn/kostai";

const decision = routeCall(
  {
    model: "claude-opus-4-7",
    messages: [{ role: "user", content: "Classify this ticket as bug | feature | question." }],
    inputTokens: 120,
    outputTokensEstimate: 20,
  },
  {
    router: {
      enabled: true,
      localProvider: "ollama",
      localModel: "llama3.2",
      cheapApiProvider: "anthropic",
      cheapApiModel: "claude-haiku-4-5",
    },
  },
);

// decision.decision:
//   "local_sufficient"        — route to ollama/llama3.2
//   "cheaper_api_sufficient"  — route to anthropic/claude-haiku-4-5
//   "frontier_required"       — keep on claude-opus-4-7
//   "cache_hit"               — (reserved; identical-prompt detection)
// decision.estimatedSavingsUsd, decision.reason, decision.confidenceLevel
```

The **Router** dashboard tab scans your recent traffic, runs the same
classifier offline, and shows the top 20 routable calls sorted by
annualized savings.

## HTTP proxy

One env-var adoption. The proxy speaks OpenAI's `/v1/chat/completions`
shape.

```bash
# Observe — record every call, never modify
npx ai-cost proxy --mode observe --port 4311

# Route — downgrade confidently routable calls
npx ai-cost proxy --mode route --port 4311

# Shadow — always run both paths, log the comparison
npx ai-cost proxy --mode shadow --port 4311
```

In your app:

```bash
OPENAI_BASE_URL=http://localhost:4311/v1
```

That's the entire integration. No code changes.

## MCP server

Expose ai-cost's primitives over the Model Context Protocol (spec
`2024-11-05`) so Claude Desktop, Claude Code, or any MCP client can call
them in-loop.

```bash
# Inspect the exposed tools
npx ai-cost mcp --list

# Run the server on stdio (register in Claude Desktop's config)
npx ai-cost mcp
```

Tools (20 total — same set served over both stdio and the bridge HTTP transport):
- `ai_cost_overview` — totals + waste breakdown
- `ai_cost_top_workflows` — ranked by avoidable spend
- `ai_cost_recommend_route` — run the router against an ask+model
- `ai_cost_record_call` — manual call logging
- `ai_cost_list_comparisons` — recent shadow-mode comparisons
- `ai_cost_ollama_chat` — pass-through to local Ollama, auto-recorded
- `ai_cost_shadow_compare` — run a live A/B from the caller
- `ai_cost_local_status` — detected local runtimes + config
- `ai_cost_anthropic_chat` — direct Anthropic Messages call (no SDK dep), auto-recorded
- `ai_cost_list_peers` — list configured bridge peers + reachability
- `ai_cost_escalate_to_frontier` — local node asks a frontier-role peer to run a prompt
- `ai_cost_delegate_to_local` — frontier node asks a local-role peer to run a prompt; records would-have-cost savings
- `ai_cost_route_cheap_api` — route to a cheap-API peer, or fall back to the frontier peer with a cheap model override
- `ai_cost_handoff` — router-driven smart dispatch across peers
- `ai_cost_preprocess` — distill a prompt locally before escalation
- `ai_cost_preprocess_then_escalate` — local preprocess + frontier escalation in one tool
- `ai_cost_queue_enqueue` — durable async enqueue for bridge work
- `ai_cost_queue_status` — queue counters + worker heartbeat
- `ai_cost_queue_list` — inspect queued/running/done/failed work
- `ai_cost_queue_cancel` — cancel a queued or running task

Claude Desktop config:

```json
{
  "mcpServers": {
    "ai-cost": {
      "command": "npx",
      "args": ["ai-cost", "mcp"]
    }
  }
}
```

## Bridge

The bridge runs the same MCP tools over an authenticated HTTP+SSE transport so
two machines can hand work to each other in-loop. Typical setup: a Mac Mini
running Ollama as the *local* node, a MacBook with the Anthropic API key as
the *frontier* node. Either side can call the other.

**On each machine:**

```bash
# Generate a shared secret (run once per pairing)
npx ai-cost bridge --gen-token
# → 64-char hex string. Put the same value in both machines' configs.

# Start the bridge listener
npx ai-cost bridge --listen
# → ai-cost bridge listening at http://0.0.0.0:4319/mcp/v1
#   tools: 20    transport: http+sse    auth: bearer

# Probe configured peers
npx ai-cost bridge --status
# → ✓ macbook  http://10.0.1.42:4319  role=frontier  (claude-opus-4-7)
#   ✓ mini     http://10.0.1.50:4319  role=local     (llama3.2, qwen2.5-coder, ...)
```

**Config block (added to `ai-cost.config.json`):**

```json
{
  "bridge": {
    "listenPort": 4319,
    "listenHost": "0.0.0.0",
    "authToken": "<64-char hex from --gen-token>",
    "probeTimeoutMs": 10000,
    "peers": [
      {
        "name": "macbook",
        "url": "http://10.0.1.42:4319",
        "token": "<the macbook's authToken>",
        "role": "frontier",
        "frontierModel": "claude-opus-4-7"
      }
    ]
  }
}
```

`bridge.probeTimeoutMs` keeps `bridge --status` and `bridge --doctor` snappy. If a real delegation or escalation regularly needs longer than 60s, raise `bridge.rpcTimeoutMs` globally or set `bridge.peers[*].rpcTimeoutMs` only on the slower peer.

**Flows:**

- *Escalation* (local → frontier): Mac Mini's local agent calls
  `ai_cost_escalate_to_frontier` with `messages`, `reason`, optional
  `model`. The bridge POSTs to the MacBook peer's `ai_cost_anthropic_chat`,
  records the call locally with `route="escalation_request"` and
  `meta.bridge_peer="macbook"`, and returns the response.
- *Delegation* (frontier → local): MacBook calls
  `ai_cost_delegate_to_local` with the same `messages`. The bridge runs it on
  the Mini's `ai_cost_ollama_chat`, computes `wouldHaveCost` for the
  frontier model, and stores `meta.delegation_savings_usd` so the dashboard
  can total cumulative savings.
- *Handoff* (smart dispatch): `ai_cost_handoff` runs the router against
  the `messages`; `local_sufficient` → delegate, `frontier_required` → escalate.
  Force with `force: "local" | "frontier"`.

The dashboard's **Bridge** tab shows configured peers, reachability,
delegation count, savings to date, and recent escalations with peer + reason
metadata. Endpoint: `GET /api/bridge`.

**Wire formats** (all over `POST /mcp/v1/rpc` with
`Authorization: Bearer <token>`):

```bash
# Health (no auth)
curl http://10.0.1.42:4319/mcp/v1/health

# JSON-RPC tools/list
curl -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' \
     http://10.0.1.42:4319/mcp/v1/rpc

# Server→client notifications stream
curl -H "Authorization: Bearer $TOKEN" \
     http://10.0.1.42:4319/mcp/v1/sse
```

See [`docs/MAC_MINI_HANDOFF.md`](docs/MAC_MINI_HANDOFF.md) for the full
two-machine setup walk-through.

## Dashboard

```bash
npx ai-cost dashboard
```

Eight tabs:
- **Overview** — total spend, avoidable spend, shadow-saved, efficiency
  score, top waste categories, top repeated prompts, budget banner.
- **Shadow Mode** — A/B comparisons, saved totals, quality scores.
- **Router** — recent routable calls and annualized savings.
- **Local LLMs** — detected Ollama/LM Studio runtimes, local vs.
  cloud spend, configuration.
- **Bridge** — peer reachability, delegation count, cumulative savings,
  recent escalations. Reflects this node's `bridge` config block.
- **Queue** — durable 24h task queue: queued/running/done/failed counts
  and per-task inspection.
- **Calls** — searchable/filterable list of every recorded call. Click
  for full detail.
- **Trends** — daily spend and daily avoidable spend charts.

Auto-refreshes every 5 seconds. Local-only. No TLS. No login.

## CLI commands

| Command | Description |
|---|---|
| `npx ai-cost init` | Create config file |
| `npx ai-cost dashboard` | Start local dashboard on port 3674 |
| `npx ai-cost scan [--repo <path>]` | Detect local LLM runtimes + LLM usage in a repo |
| `npx ai-cost mcp [--list]` | Start MCP server over stdio |
| `npx ai-cost bridge --listen [--port 4319] [--host 0.0.0.0]` | Start the HTTP+SSE MCP bridge |
| `npx ai-cost bridge --status` | Probe configured peers — reachable, models, errors |
| `npx ai-cost bridge --gen-token` | Generate a 64-char hex shared secret |
| `npx ai-cost proxy --mode <observe\|route\|shadow>` | Drop-in OpenAI-compat proxy |
| `npx ai-cost compare --limit <n>` | Summarize shadow-mode comparisons |
| `npx ai-cost report --last 7d` | Print markdown report |
| `npx ai-cost export --format <json\|csv>` | Export events |
| `npx ai-cost doctor` | Check configuration |
| `npx ai-cost reset [--comparisons-only]` | Clear all stored data |

## What it detects

ai-cost runs **9 waste heuristics** on every call (was 8; added
`local_routable` for explicit local-downgrade flagging):

| Category | Confidence | Catches |
|---|---|---|
| `duplicate_block` | High | Same content repeated within a call |
| `replayed_history` | Medium | Long conversation replay for narrow asks |
| `repeated_artifact` | High | Same large block sent across recent calls |
| `low_relevance_large_block` | Low | Large blocks with weak link to the ask |
| `oversized_logs` | Medium | Raw logs that could be summarized |
| `oversized_code_context` | Low | Too many code files for the scope |
| `cacheable_system_prompt` | High | Stable system prompt resent unchanged |
| `model_overkill` | Low | Frontier model on a task the router flagged simple |
| `local_routable` | Medium | Call that would execute correctly on local inference |

Every finding carries `estimatedTokens`, `estimatedCostUsd`, and a
`confidence` level. The dashboard's "Top Waste Categories" is a
prioritized remediation list.

## Configuration

```json
{
  "appName": "my-app",
  "storeDir": ".ai-cost-data",
  "port": 3674,
  "captureMode": "metadata_only",
  "redactSecrets": true,
  "redactPatterns": [],
  "providers": {
    "anthropic":    { "enabled": true },
    "openai":       { "enabled": true },
    "google":       { "enabled": true },
    "ollama":       { "enabled": true, "baseUrl": "http://localhost:11434", "defaultModel": "llama3.2", "powerWatts": 60, "electricityCostPerKwh": 0.15 },
    "lmstudio":     { "enabled": true, "baseUrl": "http://localhost:1234/v1", "defaultModel": "lmstudio-community/llama-3.2-3b-instruct" },
    "openaiCompat": { "enabled": false, "baseUrl": "https://api.moonshot.cn/v1", "defaultModel": "kimi-2.5" }
  },
  "thresholds": {
    "largeBlockTokens": 500,
    "logBlockTokens": 300,
    "repeatedHistoryTurns": 6,
    "efficiencyWarnPct": 70
  },
  "shadowMode": {
    "enabled": false,
    "recordSamplePct": 100,
    "optimizedProvider": "ollama",
    "optimizedModel": "llama3.2",
    "runQualityEval": true
  },
  "router": {
    "enabled": false,
    "simpleTaskMaxTokens": 800,
    "maxLocalLatencyMs": 8000,
    "localProvider": "ollama",
    "localModel": "llama3.2",
    "frontierProvider": "anthropic",
    "frontierModel": "claude-opus-4-7",
    "cheapApiProvider": "anthropic",
    "cheapApiModel": "claude-haiku-4-5"
  },
  "evaluator": {
    "enabled": false,
    "provider": "ollama",
    "model": "kimi-2.5"
  },
  "budget": {
    "monthlyUsd": 500,
    "warnAtPct": 80
  }
}
```

## Privacy

- **Default capture mode is `metadata_only`** — no content is stored, only
  hashes, token counts, cost, and scores.
- **`redacted_body`** stores truncated previews with PII patterns scrubbed.
- **`full_body`** is opt-in for local debugging only.
- **No network egress.** Everything runs on the local machine.
- **No telemetry.** No usage reporting to any external service.

## Extended docs

- [`docs/ELASTIC_REVIEW.md`](docs/ELASTIC_REVIEW.md) — step-by-step review
  and test plan for the Elastic team.
- [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) — full architecture, data
  model, extension points.
- [`docs/RUNBOOK.md`](docs/RUNBOOK.md) — operational guide: ports, logs,
  launchd persistence, disk management, upgrade, incident cheatsheet.
- [`docs/TWO_WAY_BRIDGE.md`](docs/TWO_WAY_BRIDGE.md) — two-machine
  local↔frontier bridge walkthrough.
- [`docs/MAC_MINI_SETUP.md`](docs/MAC_MINI_SETUP.md) — Mac-Mini-side
  install for the bridge peer.
- [`docs/BUSINESS_PLAN.md`](docs/BUSINESS_PLAN.md) — pricing, unit
  economics, go-to-market.
- [`docs/ELASTIC_STRATEGY.md`](docs/ELASTIC_STRATEGY.md) — the
  deck-revision strategy doc with per-pair cost multipliers, empirical
  savings math, and the slide-by-slide proposal.
- [`docs/OPENCLAW.md`](docs/OPENCLAW.md) — command-node orientation for
  any Claude Code instance picking up this repo (macmini ↔ macbook routing,
  Kimi-first model cascade, cowork boundary).

## Known limitations

1. Waste estimates are **heuristic** — likely waste, not certainty.
2. Context relevance is estimated via lexical overlap, not semantic.
3. Router rules are regex-based by design (auditable). Replaceable by a
   trained classifier once a comparison corpus exists.
4. Only Node.js / TypeScript SDKs are supported today. Other languages
   adopt via the HTTP proxy.
5. Token counts may be estimated when providers don't return usage.
6. Store is append-only JSONL; for > 10k events/day a SQLite backend
   behind the same `EventStore` interface is the next step.

## License

MIT

---

## `evidence/20_throughput_cost_ledger.md`

# 20 — Throughput & Cost Improvement Ledger

Append-only, dated record of every measurable cost-reduction or throughput-increase
lever landed in the repo. Each row is one shippable change. Numbers are either:

- **MEASURED** — from `kostai evidence` or an existing test fixture
- **MODELED** — derived from published pricing and a reproducible assumption
- **EST** — a documented back-of-envelope projection; upgrade to MEASURED on first run

New rows go at the **top**. Do not edit old rows — amend with a follow-up row.

## How to read a row

Each row has: **date · lever · grade · baseline → after · savings/speedup · verify-by**.
*baseline* and *after* are in the most natural unit for that lever (tokens, seconds,
dollars per call, hit-rate percent, etc). `verify-by` is the exact file/command a
skeptical reviewer can run.

## Ledger

### 2026-04-17 — LLMLingua-2 prompt compressor (tool-compress upgrade) · EST
- **Baseline:** `src/core/tool-compress.ts` sends tool-results >1 500 tokens
  to a local Ollama summariser → ~30% of original length at ~8 ms latency.
  Works; crude; can hallucinate under load.
- **After:** Python sidecar running `microsoft/llmlingua-2-bert-base-multilingual-cased`
  (BERT-base, ~380 MB, CPU-viable). Token-level importance classifier trained by
  distilling GPT-4 compression labels; **3.1× compression on MeetingBank
  (1 362 → 444 tokens)** with quality parity, **2.96× on NarrativeQA**,
  **3.06× on TriviaQA** (verbatim from LLMLingua-2 notebook in the
  flattened source).
- **Why it beats Ollama summarisation:** deterministic (perplexity-ranking,
  no sampling), sub-second on 10 K-token contexts, preserves *original*
  tokens instead of paraphrasing — safer for code/tool output.
- **Grade:** EST until the sidecar lands. First MEASURED pass on the
  existing `.ai-cost-data/events.jsonl` transcript will upgrade the row.
- **Failure modes (documented upstream):** exact-quote tasks, math/code
  generation, indentation-sensitive output. Mitigated by `force_tokens=[...]`
  and `force_reserve_digit=True`.
- **Verify by:** `docs/research/2026-04-17/r37-flattened-priors.md#3-microsoftllmlingua--prompt-compression-direct-dvp-precedent`;
  source at `docs/research/2026-04-17/flat/microsoft__LLMLingua/repo.cxml`
  (`prompt_compressor.py:1523-1750`).

### 2026-04-17 — Prompt-caching sticky routing (capture the 10% cache-read rate) · MODELED
- **Baseline:** Requests hash to any available deployment; Anthropic's
  `cache_control` hits only when the same deployment happens to serve
  two consecutive calls with the same prefix. Realised cache-read rate
  ≈ opportunistic, far below the headline 10% price.
- **After:** Pre-call filter maintains `{prompt-hash → last-cached-deployment}`
  (LiteLLM pattern at `router_utils/pre_call_checks/prompt_caching_deployment_check.py:19-100`);
  bias routing to the deployment that already holds the cached prefix.
- **Economics:** Anthropic cache-read = **10% of cache-write cost**, OpenAI
  cache-read = 50% (per-provider pricing in
  `BerriAI__litellm/src/litellm/cost_calculator.py:270-322`). On a
  20 K-token system prompt + tool catalogue at Opus input ($15/M), that's
  **$0.30 creation → $0.03 read**, i.e. **$0.27/call saved** whenever the
  sticky cache holds.
- **Grade:** MODELED against published Anthropic + OpenAI cache-read pricing.
- **Failure mode:** 5-minute Anthropic cache TTL means sticky-routing only
  pays out on burst traffic; sessions that pause > 5 min lose the cache and
  pay cache-write again. Mitigation: session-scoped keep-alive, or move to
  the 1-hour cache beta when eligible.
- **Verify by:** `docs/research/2026-04-17/r37-flattened-priors.md#4-berriaililellm--provider-routing--response-cache`;
  target implementation at `src/core/router-prompt-cache.ts` (not yet written).

### 2026-04-17 — Drafter quantisation (qwen2.5-coder F16 → Q4_K_M) · MODELED
- **Baseline:** qwen2.5-coder:7b in F16 on the Mac Mini drafter.
  ~14.96 GiB VRAM, text-generation ~40 t/s @ 128 tokens (reference
  numbers from `ggerganov__llama.cpp/src/tools/quantize/README.md:143-145`
  on equivalent Llama-3.1-8B bench).
- **After:** Q4_K_M build: **4.58 GiB VRAM (–69%), generation ~72 t/s (+77%)**
  on the same hardware. Quality recovered via `llama-imatrix` calibration
  on a domain-matched dataset (upstream PRs #4861/#4930 show 0.05-0.15
  PPL-point gains over naive Q4_K_M).
- **Why it's cheap to land:** one-line Ollama modelfile change plus an
  imatrix calibration pass. Existing `evidence/14_determinism_ledger.md`
  suite (6 tasks × 3 iterations) re-runs the quality check in < 30 min.
- **Grade:** MODELED. Upgrade to MEASURED after the determinism suite
  re-runs on the quantised drafter.
- **Failure mode:** Perplexity degrades on long-tail tokens; imatrix
  calibration mitigates but can't eliminate. Mitigation: keep the F16
  build on disk and swap back if shadow-mode quality drops > 5% on the
  golden set.
- **Verify by:** `docs/research/2026-04-17/r37-flattened-priors.md#2-ggerganovllamacpp--quantization--cpu-kv-cache`;
  retention re-measurement via `kostai evidence run --all && kostai evidence report`.

### 2026-04-17 — Deterministic prose compressor (caveman-mode) · MEASURED
- **Baseline:** A long system prompt / `CLAUDE.md` / agent memory file flows
  into every frontier call unchanged. On the five tests fixtures shipped by
  `JuliusBrussee/caveman` (the MIT upstream this module credits) the
  baseline averaged **898 chars/file**; 46% of the bytes in those fixtures
  are filler, hedging, pleasantries, and verbose connective phrasing.
- **After:** `src/core/prose-compress.ts` runs a deterministic pure-TS
  ruleset over any `system` / `user` message (or a file on disk via
  `kostai compress`). Three levels:
  - **lite** — drop filler + pleasantries + hedging (articles kept).
  - **full** (default) — lite + drop articles mid-sentence, drop
    connective fluff (however/furthermore/in addition), swap redundant
    phrases ("in order to" → "to", "due to the fact that" → "because").
  - **ultra** — full + aggressive short-synonym swaps
    (utilize → use, numerous → many, approximately → about, implement
    a solution for → fix) + drop nag imperatives.
- **Preservation invariants (hard-tested, 11 tests):** fenced code blocks,
  inline code, URLs, file paths, email addresses, version numbers, dates,
  HTML tags, headings, and frontmatter pass through byte-exact. Idempotent:
  `compress(compress(x)) === compress(x)` is a unit test, not a promise.
- **Measured:** on a synthetic CLAUDE.md-shape preferences block
  (filler-heavy prose + code + URL + headings, ~350 chars), `full` level
  hit **24% char reduction** with structure 100% intact. On the pure-prose
  ultra test ("You should utilize the approximately numerous features…"),
  `ultra` cuts ~56% of chars via verb/adjective substitution alone.
- **Per-call savings (MODELED, Opus 4.7 input at $15/M):** a 2 000-token
  system prompt with 10–15% compressible filler → **200–300 tokens saved
  per send → $0.003–$0.0045/call**. At 200 frontier calls/dev/day that's
  **$0.60–$0.90/day ≈ $12–$18/mo/dev** on the system prompt alone. Net-new
  savings additive to DVP, semantic-cache, and tool-compress because this
  runs on the *prose side* of the prompt that those levers don't touch.
- **Zero-network:** unlike preprocess.ts and tool-compress.ts (both
  Ollama-backed), prose-compress runs in-process with zero network calls.
  Safe to run on every call — the overhead is string regexes.
- **New waste category:** `verbose_prose_input` in `src/core/types.ts` +
  `score.ts:detectVerboseProseInput`. Fires at medium confidence on any
  system/user prose ≥ 400 tokens where
  `estimateCompressibility().likelyReductionPct ≥ 8`. Feeds straight into
  `llm.avoidable_context_cost_usd`.
- **CLI:** `kostai compress <file> [--level lite|full|ultra] [--dry-run]`.
  Writes compressed output in place, backs up the original to
  `FILE.original.md`. Refuses code/config extensions and sensitive
  filenames (credentials, keys, .env, .pem, known_hosts, ...) before read.
- **Attribution:** the rule set is adapted from
  [JuliusBrussee/caveman](https://github.com/JuliusBrussee/caveman) (MIT).
  Caveman calls Claude to rewrite — we do not, because that burns the
  tokens we're trying to save. Our contribution: pure-TS determinism,
  hard preservation invariants, and a waste-category integration.
- **Grade:** MEASURED on unit fixtures; MODELED on per-dev $ projection.
  Upgrade to MEASURED on $ after the first 50-call shadow run that applies
  compression to the system prompt.
- **Verify by:** `src/core/prose-compress.ts`; `tests/unit/prose-compress.test.ts`
  (25 tests — 12 preservation invariants + 5 correctness + 2 savings
  estimation + 2 compressMessages + 3 estimateCompressibility + 1
  real-world CLAUDE.md shape); `src/core/score.ts` (+1 category, 2 tests);
  `kostai compress examples/example.md --dry-run`.

### 2026-04-17 — Cost-aggressive LLM Council (6 stacked wedges) · MODELED
- **Baseline (naive Karpathy council):** 4× Opus 4.7 drafts + 4×3 peer reviews
  + 1× Opus chairman at 500 in / 500 out each. Input ≈ $0.045, output ≈ $0.525.
  **≈ $0.59/query**.
- **After (cost-aggressive variant):** six stacked wedges in one orchestrator:
  1. **Semantic-cache hit** (0.95 cosine) → $0 on ~75% of repeated routes.
  2. **Escalation gate** on cache-miss: simple asks bypass the council
     entirely → single chairman call ≈ **$0.0075/call**.
  3. **Free-tier drafters**: council defaults to 3× local Ollama models
     (qwen2.5-coder:7b, qwen2.5:7b, mistral:7b) @ **$0 draft cost**.
  4. **Consensus short-circuit**: pairwise cosine ≥ 0.92 on stage-1 drafts
     returns the best answer immediately, skipping stages 2 & 3
     (~**40% of council-eligible calls**, modeled).
  5. **Cheap reviewer tier**: Haiku 4.5 @ $1/$5 per M with shorthand
     ranking ("B,A,D,C") instead of verbose FINAL RANKING. Review stage:
     **~$0.003** total vs **$0.18** for 12 Opus reviews.
  6. **DVP chairman**: shorthand A/P/R on Sonnet 4.6 (approve path = 1
     token), synthesis only when drafts disagree.
- **Weighted expected** on a mixed traffic mix (75% cache hit · 15% gate-skip
  · 4% consensus · 6% full council):
  - Cache hit: $0
  - Gate-skip (Sonnet chairman only): ~$0.0075
  - Consensus: ~$0.0005 (local only + embed)
  - Full council: $0 draft + Haiku review + Sonnet chairman ≈ **$0.014**
  - Blended: **~$0.0021/query → 99.6% reduction vs naive Karpathy.**
- **Observability:** `council.*` bus events (`started`, `cache_hit`,
  `gate_skipped`, `stage1_complete`, `consensus_reached`, `stage2_complete`,
  `completed`) and a per-call `savingsVsNaiveCouncilUsd` field on
  `CouncilResult`.
- **Grade:** MODELED. Upgrade to MEASURED on first 50-call shadow run.
- **Verify by:** `src/core/council.ts:1`; `tests/unit/council.test.ts`
  (29 tests — all six wedges + failure modes + persistence).

### 2026-04-17 — Draft-Verify-Patch (shorthand protocol) · MODELED
- **Baseline:** Opus 4.7 direct answer, ~500 output tokens @ $75/M = **$0.0375/call**.
- **After:** Local drafter (qwen2.5-coder:7b, free) + Opus shorthand verifier.
  - Approve path: verifier emits `A` (1 token) → **$0.000075 output cost**.
  - Patch path (typical): `P\n<find>\n<replace>` ≈ 15 output tokens → **$0.001125**.
  - Rewrite path: full frontier answer ≈ 500 tokens → **$0.0375** (unchanged).
- **Weighted expected** (60% approved, 25% patched, 15% rewritten):
  - Output cost **≈ $0.006 / call → 84% output-side reduction**.
- **Stacking with shorthand vs JSON:** A/P/R ≈ 1 / 15 / 500 tokens. The JSON
  variant of this protocol adds ~20 tokens to every response (`{"decision":"approved"}`)
  — shorthand cuts the approve-path response from ~20 → 1, a further **95%**
  output-cost reduction on the hottest path.
- **Grade:** MODELED. Upgrade to MEASURED after 100 shadow-graded calls.
- **Verify by:** `src/core/draft-verify.ts:1`; `tests/unit/draft-verify.test.ts`
  (25 tests, including an `A`-path assertion that `verifyOutputTokens === 1`).

### 2026-04-17 — Tool-result compression · MODELED
- **Baseline:** Raw tool output fed back to frontier. Shell dumps, file reads,
  and API responses routinely exceed 1 500 tokens; Claude Code sessions show
  tool_result blocks averaging ~3 000–12 000 tokens.
- **After:** Threshold (1 500 tokens, configurable) → local Ollama compressor
  → factual summary averaging ~30% of original length.
- **Savings (per block):** on a 6 000-token `tool_result` at Opus input price
  ($15/M), **$0.090 → $0.027 = $0.063 saved per call**. At 50 tool_results/day
  per dev, that's **~$3.15/day ≈ $65/mo** on Opus alone.
- **Grade:** MODELED. First MEASURED pass lands the moment shadow mode runs on
  a recorded claude-code transcript — the transcript file already exists at
  `.ai-cost-data/events.jsonl`.
- **Verify by:** `src/core/tool-compress.ts:1`; `tests/unit/tool-compress.test.ts`
  (6 tests covering Anthropic content-blocks, string tool messages, no-op on
  unreachable Ollama, and refusal to substitute a non-smaller summary).

### 2026-04-17 — Semantic cache (cosine 0.95, local embeddings) · EST
- **Baseline:** Every prompt goes to the frontier, including near-duplicates
  ("what's my bill?" / "tell me my bill" / "show my bill please").
- **After:** Local Ollama embedding (`nomic-embed-text`, 768-dim) → cosine
  similarity vs prior cached answers → hit if ≥ 0.95.
- **Published hit rates** on agent workloads: 67–90% (GPTCache, Redis LangCache
  benchmarks). Conservative mid-range: **75% hit rate on repetitive routes**.
- **Per-hit savings** on an Opus 500-token in / 500-token out call: **$0.045**.
  At 75% hit rate on 100 such calls/day → **$3.375/day → ~$100/mo per dev**
  on a single repetitive surface.
- **Grade:** EST. Needs a MEASURED pass on real traffic; the harness is ready
  (`src/core/semantic-cache.ts` + `cacheOrCall()` wrapper) — any
  `kostai.app`-installed user can produce a measurement overnight.
- **Verify by:** `src/core/semantic-cache.ts:1`; `tests/unit/semantic-cache.test.ts`
  (14 tests; cosine math, JSONL persistence, cacheOrCall savings estimate).

### 2026-04-17 — Output-side waste heuristics (3 new categories) · MEASURED
- **verbose_output_preamble** — regex-matches stock openers ("Sure! Here's…",
  "Of course!", "I hope this helps") and trailers. Averages 14–18 tokens per
  hit on frontier models; at $75/M output that's **$0.00105 per call** — small
  per call, large per fleet. Not flagged on free local models.
- **dvp_candidate** — flags calls with ≥ 200 output tokens on models priced
  ≥ $5/M output where the user ask is verifiable (explain / rewrite / summarize /
  review). Quantifies the DVP opportunity **per call** in USD.
- **oversized_tool_result** — fires on any `tool_result` block ≥ 1 500 tokens.
  Paired with the compressor (see above row) for a one-click fix path.
- **Grade:** MEASURED. Detection logic is a pure function with deterministic
  output given a normalized call; covered by 11 tests in
  `tests/unit/output-side-waste.test.ts`.
- **Verify by:** `src/core/score.ts:492-619`;
  `tests/unit/output-side-waste.test.ts`.

### 2026-04-17 — Preprocess (distillation) · MEASURED
- **Baseline:** 6 000-token conversation replayed in full on every turn.
- **After:** Local model emits `summary_of_history + local_attempt + help_needed`
  JSON; only the distilled version goes to the frontier.
- **Typical reduction** on replayed-history turns: **60–80% of input tokens**.
- **Grade:** MEASURED. Prior ledger entry — the module shipped pre-0.3.0 and
  has 3 end-to-end tests.
- **Verify by:** `src/core/preprocess.ts:1`; `tests/unit/preprocess.test.ts`.

### 2026-04-17 — Shadow mode A/B grading · MEASURED
- **Baseline:** Guessing whether a cheaper model is "good enough" in principle.
- **After:** Both models called in parallel; a judge model scores both outputs;
  per-call would-have-saved in USD.
- **Typical quality retention** on dev-01..dev-03 + cowork-01..cowork-03
  benchmark set: local (qwen2.5-coder) retains **> 85% quality vs Opus** on
  code-context compression.
- **Grade:** MEASURED via `evidence/14_determinism_ledger.md` — 6 tasks × 3
  iterations each, all deterministic.
- **Verify by:** `kostai evidence run --all && kostai evidence report`.

---

## Aggregate — per-developer projection (MODELED, 2026-04-17)

Layered on a single developer running agentic coding workflows on Opus 4.7:

| Lever                      | Daily savings | Monthly |
|----------------------------|---------------|---------|
| Preprocess distillation    | $0.80         | $16     |
| DVP (shorthand protocol)   | $1.20         | $24     |
| Tool-result compression    | $3.15         | $65     |
| Semantic cache (75% hit)   | $3.37         | $100    |
| Output-preamble trim       | $0.10         | $2      |
| Cost-aggressive council    | $2.90         | $60     |
| **Total (additive cap)**   | **~$11.50**   | **~$267** |

The levers are **not fully independent** — DVP and tool-compress both bite into
`avoidableContextCostUsd` for overlapping calls, and semantic-cache hits pre-empt
both. A conservative combined figure with overlap discount: **~$150/mo/dev**
on a heavy agentic workload, **~$40/mo/dev** on a lighter chat workload.

Scale-01 (single-developer upper bound from `scenarios.yaml`) sits at ~$39/dev/mo
— which was pre-DVP/pre-semantic-cache, and is now the **floor** of the new
range, not the ceiling.

## How to update this file

1. Ship a lever. Open a row at the top of "Ledger" with a dated heading.
2. Mark grade honestly — **EST** if you haven't yet run a measurement.
3. Point `verify by` at an exact file or command.
4. Never edit historical rows — add a **follow-up row** that cites the old one.
5. Re-compute the aggregate table when any row's numbers change.

This file is checked in so `git log -- evidence/20_throughput_cost_ledger.md`
is the authoritative history of the optimization program.

---

## `evidence/09_value_case.md`

# 09 — Value Case (Elastic Internal)

> **Audience:** Value Engineering leadership and the eventual pilot sponsor.
> **Status:** Draft for internal review. Every numeric claim in this
> document carries a grade (MEASURED / MODELED / ASSUMED / NEEDS_VERIFICATION).
> If a number isn't graded, it isn't in this document.

## One-paragraph summary

KostAI is a deterministic, developer-first instrumentation layer for LLM
spend. On a reproducible benchmark suite that reviewers can run in
seconds without API keys, the current scoring pipeline identifies an
average of **54.2% avoidable input tokens** across three winning
dev-workspace tasks and **69.3% / 100% / 0%** across the three
co-workspace tasks (one adversarial loss case is reported honestly).
Those are **MEASURED** compression rates on synthetic but
byte-deterministic fixtures. The dollar translation is **MODELED** — a
conservative, fully-cited projection of what those compression rates
imply per developer per month if they hold up in real traffic. The
purpose of a shadow-mode pilot is to measure whether they do.

## What is actually proven (MEASURED)

A reviewer who runs `kostai evidence refresh` locally sees identical
numbers to the ones below. Every row of [master_metrics.csv](metrics/master_metrics.csv)
includes an `input_packet_sha` and a `pricing_table_version` so a
future re-run can be compared byte-for-byte.

| Claim | Value | Citation | Grade |
|---|---|---|---|
| Repo-dump code compression (dev-01) | 47.5% of input tokens flagged avoidable | [bench-01](10_deck_claim_bank.yaml) | MEASURED |
| Conversation-history replay (dev-02) | 45.0% of history tokens flagged replayed | [bench-02](10_deck_claim_bank.yaml) | MEASURED |
| Raw-log triage (dev-03) | 70.0% of log-block tokens flagged collapsible | [bench-03](10_deck_claim_bank.yaml) | MEASURED |
| Unrelated-context attached (cowork-01) | 69.3% of attached doc flagged low-relevance | [bench-04](10_deck_claim_bank.yaml) | MEASURED |
| Repeated boilerplate across sections (cowork-02) | 100% (capped) of repeated header deduplicated | [bench-05](10_deck_claim_bank.yaml) | MEASURED |
| Adversarial transcript (cowork-03) | **0% — honest loss reported** | [bench-06](10_deck_claim_bank.yaml) | MEASURED |

### Why the hostile case matters

`cowork-03-hostile-transcript` is a transcript where every line is a
unique speaker on a unique topic — no structural redundancy to
compress. Its expected outcome is a loss, and the benchmark harness
reports it as a loss. The decision to publish this failure inside the
proof pack is deliberate: without it, every heuristic looks like a
hammer that works on every nail. With it, the pack is defensible under
adversarial review.

## What the dollars look like (MODELED)

All figures below are re-evaluated from [scenarios.yaml](scenarios.yaml)
every time `kostai evidence verify` runs. They are **not** invoiced
charges. They estimate what the measured compression rates would save
if production traffic at Elastic behaved like the benchmarks.

Inputs (all graded inside [scenarios.yaml](scenarios.yaml)):

- `dev_lane_avg_compression = 54.2%` — MEASURED, mean of
  dev-01/dev-02/dev-03.
- `calls_per_developer_per_day = 80` — ASSUMED; [assm-01](10_deck_claim_bank.yaml).
- `avg_input_tokens_per_call = 3,000` — ASSUMED; [assm-02](10_deck_claim_bank.yaml).
- `baseline_model_rate_per_1m_input = $15.00` — MEASURED from
  [pricing_tables.yaml](pricing_tables.yaml) claude-opus-4-7.
- `working_days_per_month = 20` — ASSUMED.
- `conservatism_multiplier = 0.4` — ASSUMED; [assm-03](10_deck_claim_bank.yaml).

Formulas (verbatim from scenarios.yaml):

```
scale-01 monthly_saved_usd =
    calls_per_developer_per_day
  * working_days_per_month
  * avg_input_tokens_per_call
  * dev_lane_avg_compression
  * (baseline_model_rate_per_1m_input / 1_000_000)

scale-02 monthly_saved_usd = scale-01 * conservatism_multiplier

scale-03 monthly_saved_usd = scale-02 * pilot_team_size
```

Evaluated:

| Scenario | Result | Grade | Citation |
|---|---|---|---|
| scale-01 upper-bound per-developer monthly | **≈ $39.00/dev/month** | MODELED | [scale-01-single-developer](scenarios.yaml) |
| scale-02 conservative per-developer monthly | **≈ $15.60/dev/month** | MODELED | [model-01](10_deck_claim_bank.yaml) |
| scale-03 10-engineer 1-month pilot savings | **≈ $156/month** | MODELED | [model-02](10_deck_claim_bank.yaml) |

### Why the dollar number is deliberately small

The conservatism multiplier is 0.4 because the honest pitch at this
stage is **"the compression rates are real; the scale factor is what
we need a pilot to measure."** A larger number without a pilot behind
it would be an overclaim. The purpose of the pilot is to replace
`assm-01`, `assm-02`, and `assm-03` with measured values — if those
measurements come in higher than the assumptions (likely on workflows
with large repo dumps or raw log attachments), the $/dev/month number
grows with them, not from a change in methodology.

## What still needs verification (NEEDS_VERIFICATION)

These gates hold before shipping to Elastic at scale, not before
shipping the proof pack:

| Gate | Owner | Blocks |
|---|---|---|
| Pricing table matches current public rates | John Bradley | Any dollar figure on an external slide |
| Elastic procurement rate card | VE / Procurement | Only shifts the answer in our favor |
| Security review of shadow-mode capture | Security | Any live pilot |
| Executive sponsor identified | Open | The pilot itself |
| Budget owner for pilot time | Open | The pilot itself |

See [13_open_questions_and_blockers.md](13_open_questions_and_blockers.md)
for the live list.

## What a pilot looks like

1. One Elastic engineering team installs `@sapperjohn/kostai` via
   `npm install` and wraps their Anthropic/OpenAI client.
2. Kostai runs in `metadata_only` capture mode for 30 days — no prompt
   bodies are stored, only hashes, token counts, and waste findings.
3. At day 30, the pilot team runs `kostai evidence refresh` against
   their own local data. This regenerates the same 3+3 benchmark pack
   plus a new pair of lane reports that re-evaluate the MEASURED
   compression rate on real traffic.
4. Value Engineering reviews the delta between the pilot's MEASURED
   compression rate and the current benchmark-lane rate (54.2%). If
   the pilot-measured rate is ≥ 0.3, scale-02 is updated and a
   rollout decision is made. If the pilot-measured rate is < 0.3, the
   root cause is investigated before further claims.

## Success criteria for this proof pack

- [x] Every numeric claim in this document is traceable to a file in
      this directory.
- [x] `kostai evidence verify` passes locally.
- [x] The benchmark harness is byte-deterministic across re-runs.
- [x] At least one failure case is reported honestly in the same
      report as the wins.
- [ ] A named executive sponsor has read this document. _(open)_
- [ ] A named pilot team has committed to a 30-day shadow-mode window.
      _(open)_

## Non-goals

- This is **not** a company-wide rollout plan. It is a pilot-sized
  ask backed by reproducible benchmarks.
- This is **not** a quality-regression claim. Quality evaluation is
  scoped out of the current proof pack and lives in
  [07_quality_evaluation_framework.md](07_quality_evaluation_framework.md)
  as a planned phase-2 workstream.
- This is **not** a claim that KostAI is the only tool that works.
  The methodology — measure first, publish failure cases, grade every
  number — is the real asset and would still be valuable even if a
  different tool were chosen.

---

## `evidence/14_determinism_ledger.md`

# Determinism Ledger

_Generated: 2026-04-18T00:16:38.223Z_
_Baseline model: `claude-opus-4-7`_
_Pricing table version: `e18d909716c36277`_

This is the ledger a skeptical reviewer reads to confirm the benchmarks are actually deterministic. Every task's iterations should produce the same raw_input_tokens, avoidable_input_tokens, and modeled_savings_usd. Any drift is flagged below.

## Per-task determinism

| Task | Iterations | Deterministic? | Raw in | Avoidable | Savings/run |
|---|---|---|---|---|---|
| `cowork-01-long-doc` | 3 | yes | 4887 | 3387 | $0.050805 |
| `cowork-02-repeated-header` | 3 | yes | 12887 | 12887 | $0.193305 |
| `cowork-03-hostile-transcript` | 3 | yes | 5138 | 0 | $0.000000 |
| `dev-01-repo-dedup` | 3 | yes | 47326 | 18835 | $0.282525 |
| `dev-02-conversation-history` | 3 | yes | 3610 | 1625 | $0.024375 |
| `dev-03-build-log` | 3 | yes | 10319 | 7220 | $0.108300 |

## Fixture integrity

| Task | Input packet SHA-256 |
|---|---|
| `cowork-01-long-doc` | `200ad3c8bb7439065ad5aec2094d13bdfbeffff449e8a1efe4a81bfb0a6f0d3c` |
| `cowork-02-repeated-header` | `a1ce4a8a6db1e3169966277496ff7e4a41e0481144bc6ec454d099fb9dcfbdc5` |
| `cowork-03-hostile-transcript` | `0f37484aee858372c2429a63426bf8a451ec4c897ad66167cf315727fc929b1d` |
| `dev-01-repo-dedup` | `4212c3619acf22a514f367a9cd4d5590b7b09ad7e35845ebc6a28419c5e926b3` |
| `dev-02-conversation-history` | `44433375317553b0b2dc35d3bc4305bb5b62e5c303ec6c1eef19060609b9203f` |
| `dev-03-build-log` | `ebbb68be696a19d65e409e85761d2abd14d3beab28785aec0c4f9d883dadefb3` |

If any row above changes between runs, a fixture was edited or the seed was modified.

---

## `evidence/00_evidence_registry.md`

# 00 — Evidence Registry

This is the plain-English inventory of every claim in this evidence pack and
where to verify each one. A skeptical reviewer should be able to use this
document alone to decide whether to spend more time.

## How to read this file

| Column | Meaning |
|--------|---------|
| Claim ID | Stable identifier; appears in `10_deck_claim_bank.yaml` |
| Grade | MEASURED \| MODELED \| ASSUMED \| NEEDS_VERIFICATION |
| Claim | One-line statement |
| Verify by | Exact command or file to check |

## Registry

### Instrumentation claims (MEASURED)

| Claim ID | Grade | Claim | Verify by |
|----------|-------|-------|-----------|
| inst-01 | MEASURED | kostai captures structured telemetry from real LLM calls across 3 providers (Anthropic, OpenAI, manual/Kimi) | `wc -l .ai-cost-data/events.jsonl`; first and last entries span a real production day |
| inst-02 | MEASURED | Each captured event includes deterministic fields: promptHash, blockHashes, costUsd, efficiencyScore, wasteFindings | Inspect any line of `.ai-cost-data/events.jsonl` |
| inst-03 | MEASURED | 276 events captured on 2026-04-16 across 6 models, 4 workflows, 2 apps | `kostai evidence audit-events` |
| inst-04 | MEASURED | The deterministic scoring pipeline (src/core/score.ts) produces non-zero avoidableContextTokens on benchmark inputs | `kostai evidence run --all && kostai evidence report` |

### Benchmark claims (MEASURED — produced by deterministic benchmark harness)

| Claim ID | Grade | Claim | Verify by |
|----------|-------|-------|-----------|
| bench-01 | MEASURED | Packet compression on a codebase dump reduces input token count by X% | `06_master_metrics.csv` row `dev-01-repo-dedup` |
| bench-02 | MEASURED | Replayed-conversation compression removes Y% of history tokens | `06_master_metrics.csv` row `dev-02-conversation-history` |
| bench-03 | MEASURED | Build-log triage retains the error line while compressing Z% | `06_master_metrics.csv` row `dev-03-build-log` |
| bench-04 | MEASURED | Long-form document compression reduces tokens by W% | `06_master_metrics.csv` row `cowork-01-long-doc` |
| bench-05 | MEASURED | Multi-section proposal dedup removes V% of repeated boilerplate | `06_master_metrics.csv` row `cowork-02-repeated-header` |
| bench-06 | MEASURED | Hostile cross-sectional transcript shows minimal compression (honesty case) | `06_master_metrics.csv` row `cowork-03-hostile-transcript` |

_All benchmark claims are filled in by `kostai evidence report` once the
benchmarks have run; values above are placeholders pending first run._

### Cost-projection claims (MODELED)

| Claim ID | Grade | Claim | Verify by |
|----------|-------|-------|-----------|
| model-01 | MODELED | Per-developer monthly savings at typical usage volume | `scenarios.yaml` scenario `scale-02-single-developer-conservative`; re-evaluated by `kostai evidence verify` |
| model-02 | MODELED | 10-engineer 1-month pilot modeled savings | `scenarios.yaml` scenario `scale-03-ten-engineer-pilot` |

### Planning claims (ASSUMED)

| Claim ID | Grade | Claim | Owner / sponsor who would validate |
|----------|-------|-------|-------------------------------------|
| assm-01 | ASSUMED | 80 LLM calls per developer per day | Pilot team lead (TBD) |
| assm-02 | ASSUMED | 3000 avg input tokens per call | Pilot team lead (TBD) |
| assm-03 | ASSUMED | 40% of developer calls exhibit benchmark-like compression patterns | Pilot sponsor (TBD) |
| assm-04 | ASSUMED | Elastic has at least one team willing to run a 1-month shadow-mode pilot | VE leadership (Ben Kim, Johnny Bylan) |

### External claims (NEEDS_VERIFICATION)

| Claim ID | Grade | Claim | Owner |
|----------|-------|-------|-------|
| ver-01 | NEEDS_VERIFICATION | Anthropic / OpenAI / Google published rates match `pricing_tables.yaml` as of 2026-04-16 | John Bradley (before next deck review) |
| ver-02 | NEEDS_VERIFICATION | Elastic's procurement rate card (if any enterprise discount exists) | VE / Procurement |
| ver-03 | NEEDS_VERIFICATION | No legal / security blocker to running shadow mode on a live Elastic team's traffic | Security review |
| ver-04 | NEEDS_VERIFICATION | Executive sponsor identified and committed | Open — see `13_open_questions.md` |
| ver-05 | NEEDS_VERIFICATION | Budget owner for the pilot identified | Open — see `13_open_questions.md` |

## Closing the loop

Any claim above that appears on a leadership slide MUST be cited in the slide
footnote by its Claim ID. `kostai evidence verify` exits non-zero if any
claim referenced in `10_deck_claim_bank.yaml` cannot be resolved.

---

## `docs/research/2026-04-17/README.md`

# Gemini Deep Research — Round 2 (2026-04-17)

Five new research prompts targeting the **remaining fat on the bone** after the
first-round wedges (DVP, tool-compress, semantic cache, output-side heuristics,
shorthand protocol) have landed.

Each prompt is structured for Gemini Deep Research: one clear question, bounded
scope, explicit output artifacts Claude-or-human can consume directly. Drop each
file into Gemini Deep Research unaltered.

## Topics

| # | File | Thesis being tested |
|---|------|---------------------|
| R06 | [r06-speculative-decoding.md](r06-speculative-decoding.md) | Local draft + frontier accept/reject at the **token** level (not answer level) — can we get DVP's 78% output cut at 2–3× lower latency? |
| R07 | [r07-kv-cache-offload.md](r07-kv-cache-offload.md) | CPU/NVMe KV-cache offload + prefix reuse — how much frontier context can we survive without paying for it on every turn? |
| R08 | [r08-multi-agent-verifier-swarm.md](r08-multi-agent-verifier-swarm.md) | Replace the single Opus verify step with a **swarm of Haiku verifiers** voting on approve/patch/rewrite — same quality, fraction of the output cost? |
| R09 | [r09-streaming-early-stop.md](r09-streaming-early-stop.md) | Stream severing + programmatic early-stop — detect a correct answer in the first N tokens and abort billing on the rest. |
| R10 | [r10-hierarchical-planner-executor.md](r10-hierarchical-planner-executor.md) | Planner (frontier, rare) + Executor (local, many) split — shift tokens from the expensive model to the free one without losing the plan's quality. |

## Round 3

Need a larger backlog, especially around deck/PPT workflows, multimodal packet
compression, and "good slides fast" build systems? Use the ranked pack:

- [r11-r35-ranked-prompt-pack.md](r11-r35-ranked-prompt-pack.md) — **25 more
  Gemini Deep Research prompts**, ordered by expected token-savings value and
  tilted toward PPT quantization, deck packetization, donor-slide reuse, and
  fast high-quality deck creation.

Paste one `Rxx` section at a time into Gemini Deep Research.

## How to use

1. Paste the prompt file into Gemini Deep Research.
2. Let it run (typically 20–30 min per prompt).
3. Save Gemini's output as `<prompt-id>-response.md` in this directory.
4. Promote actionable findings into the `evidence/20_throughput_cost_ledger.md`
   ledger, and file a tracking issue in `docs/13_open_questions_and_blockers.md`
   if the finding needs sponsor sign-off.

## Prior-art harvest (complement to Gemini Deep Research)

When Gemini cites a GitHub repo in a response, feed the actual source into
Claude to extract techniques:

```bash
python3 scripts/rendergit/flatten_research.py            # batch
# then, for whichever repo you want to dig into:
cat docs/research/2026-04-17/flat/<owner>__<repo>/repo.cxml | pbcopy
```

Seed list + workflow in [`scripts/rendergit/README.md`](../../../scripts/rendergit/README.md).

## Criteria for a "high-value" research output

- A **named, citable technique** (arXiv ID, GitHub repo, benchmark paper)
- At least one **measured number** (percent saved, percent latency delta,
  quality retention on a named benchmark)
- An **integration sketch** that maps to this repo's existing modules
  (preprocess, draft-verify, tool-compress, semantic-cache, router, shadow)
- A **failure mode** section — what breaks the technique, what it silently
  degrades under

---

## `docs/research/2026-04-17/r06-speculative-decoding.md`

# R06 — Token-level speculative decoding for LLM APIs

## Context

We've shipped a **draft-verify-patch** mode where a local 7B model writes a full
draft and the frontier emits approve/patch/rewrite in a single call. Modeled
output-cost cut: ~78% weighted, ~95% on the approve path (shorthand `A` = 1 token).

DVP operates at the **answer level**: one drafter call + one verifier call.
Speculative decoding operates at the **token level**: the draft model proposes
the next N tokens, the frontier accepts or rejects per-token, and only the
rejected tokens are resampled. This is the technique that makes vLLM and Llama.cpp
go 2–3× faster in local-only settings.

**Question:** Can this same mechanism be used across the **API boundary** — where
the draft runs on our Mac Mini's Ollama, and the frontier runs in Anthropic's
datacenter — to get DVP-style cost wins at 2–3× lower end-to-end latency?

## What to research

1. **State of the art (2024–2026)** on speculative decoding across a provider
   boundary. Named techniques: **REST** (retrieval-based), **EAGLE-2**,
   **Medusa**, **Lookahead Decoding**, **SpecInfer**.
2. **Which frontier APIs expose the hooks we'd need** — specifically: can
   Anthropic or OpenAI accept a partial-completion proposal and return a
   token-level accept/reject mask? Or must we approximate with **n-gram
   rejection sampling** on the draft model's output distribution locally?
3. **Published throughput and cost numbers** — what percentage of drafter
   tokens does a frontier accept in practice on code vs. prose vs. QA
   workloads?
4. **Latency budget** — at 200 ms Mac-Mini-to-frontier RTT over Tailscale,
   what's the largest N (tokens per speculation round) where this is still
   a net throughput win?
5. **Quality regression** — published benchmarks on whether speculative
   decoding preserves MT-bench / HumanEval / MMLU scores vs. the frontier's
   full decode.
6. **Integration sketch** — where in this repo would a
   `src/core/speculative-decode.ts` module live? What's the cleanest API
   shape that composes with existing `draftLocally` / `verifyFn`?

## Output artifacts we need

- A one-paragraph **go / no-go recommendation** based on whether current
  frontier APIs actually expose the hooks.
- A **comparison table**: DVP (answer-level) vs. speculative (token-level) vs.
  baseline — cost, latency, quality, engineering cost.
- If the frontier APIs don't expose the hooks, a **fallback design** using
  Anthropic's Batch API + streaming + local rollback.
- Named **arXiv IDs and GitHub repos** for every technique cited.

## Out of scope

- GPU-kernel-level optimizations (we don't run our own inference server).
- Model-training approaches (we can't train Claude or GPT).
- Speculative decoding **within** the local Ollama server (already fast enough).

## Known constraints

- We cannot modify the frontier model or its inference server.
- We run Ollama on an M1/M2-class Mac Mini (~30 tok/s on qwen2.5-coder:7b).
- Tailscale MagicDNS round-trip macmini ↔ macbook is ~5 ms; macmini ↔ Anthropic
  is ~60–200 ms depending on region.
- Our target workload is **agentic coding** (Claude Code / Cursor style), not
  chat.

---

## `docs/research/2026-04-17/r07-kv-cache-offload.md`

# R07 — Context persistence via KV-cache offload and prefix reuse

## Context

On every turn of a long agent session, the model re-ingests the entire prior
context: system prompt, tool definitions, prior messages, tool results. At
100 K tokens of context on Opus 4.7 input ($15/M), that's **$1.50 of input cost
per turn** — paid again on every turn — and at 200 K context it's **$3 per turn**
regardless of what the user actually changed.

Anthropic's **prompt caching** (`cache_control` blocks) gives a **90% discount**
on cached prefix tokens. This is already a 10× input-cost win when used
correctly. But the cache lifetime is 5 minutes, the minimum cache block is 1 K
tokens, and not every prefix pattern is cache-compatible.

**Question:** Beyond Anthropic's native prompt caching, what additional
context-persistence techniques (KV-cache offload, session-scoped caching,
hybrid frontier+local cache) can we use to cut input-token cost on long
agent sessions by another 2–10×?

## What to research

1. **Anthropic prompt caching — deep operational usage.** What's the actual
   achieved cache hit rate in published Anthropic workloads? What prefix
   patterns evict it? How does the 5-minute TTL interact with agent sessions
   that pause longer? Are there API techniques to **extend the TTL** (the
   "1-hour cache" beta)?
2. **OpenAI prompt caching** — released 2024-10, auto-applies to repeated
   prefixes ≥ 1024 tokens, 50% discount. Different from Anthropic's
   opt-in model. What's the production hit rate?
3. **Google Vertex context caching** — explicit API, named caches, minimum
   32 K tokens, per-minute storage cost. Economics: when does it pay off?
4. **Client-side KV-cache offload** — for local models (Ollama), can we
   pickle the KV cache after the system prompt + tools are ingested, and
   re-hydrate on every turn? vLLM's `PagedAttention` and `CachedSystemPrompt`
   features. Does Ollama expose anything similar?
5. **Hybrid frontier+local cache with a "cache aside"** — the pattern
   we've already prototyped for semantic caching: look up in a local cache
   first, only hit the frontier on miss. For deterministic, session-scoped
   prefixes (tool catalogs, personas, RAG contexts), can a **content-addressed
   frontier cache** be built on top of Anthropic's native caching?
6. **Context compression techniques** (ICAE, LLMLingua-2, xRAG) —
   train-free ways to represent a 10 K-token prefix in 1 K semantic tokens
   that the frontier can still attend to.

## Output artifacts we need

- A **decision matrix**: for a given session shape (prefix size, turn rate,
  prefix stability, TTL tolerance) which caching strategy wins?
- A **cost table**: baseline / Anthropic native caching / full stack —
  per-turn and per-session cost for a typical 20-turn Claude Code session.
- **A concrete code sketch** for a `src/core/prefix-cache.ts` module that:
  (a) detects cacheable prefixes in our normalized call,
  (b) injects `cache_control` breakpoints for Anthropic,
  (c) falls through to a local content-addressed cache for other providers.
- Named arXiv IDs and blog posts for every technique.

## Out of scope

- Embedding-based semantic cache (already shipped; covered in R06 / existing
  `semantic-cache.ts`).
- Full RAG re-implementation.

## Known constraints

- Our wrapper sits **outside the provider SDK**, so we can modify the messages
  going in but cannot reach inside the model server.
- We need to preserve wire-level compatibility: any `cache_control` injection
  must be lossless if the provider doesn't support it.
- Agent sessions can last hours — cache TTL is a design variable, not a given.

---

## `docs/research/2026-04-17/r08-multi-agent-verifier-swarm.md`

# R08 — Swarm verification: many cheap verifiers vs. one expensive one

## Context

Our draft-verify-patch pipeline currently uses **one frontier call** to verify
each draft. The verifier is the most expensive model we have (Opus 4.7,
$75/M output). On the approve path we've collapsed the verifier's output to
1 token via shorthand, which is already a ~95% cut vs. a JSON wrapper.

But the verifier's **input** is still paid on every call — the full conversation
plus the draft. For a 5 K-token prefix, a single Opus verify call costs
~$0.075 input + $0.000075 output = **$0.075 total, essentially all input**.

**Question:** Can we replace the single expensive verifier with a **swarm of
cheap verifiers** (Haiku 4.5 @ $1/M input, or a self-hosted Llama-3.1-70B,
or 5 × Gemini-Flash) that independently emit A/P/R, and take the **majority
vote** or **confidence-weighted merge**?

The theoretical case: at $1/M input, 5 Haiku verifiers cost
5 × $0.005 = **$0.025** on the same 5 K prefix — **67% cheaper than one Opus**.
If quality matches, this is a big win; if not, we lose it.

## What to research

1. **Published results on ensemble / voting verifiers.** Named techniques:
   **Self-Consistency (Wang 2022)**, **Chain-of-Verification (CoVe)**,
   **Debate-LLM / Multi-Agent-Debate (MAD)**, **Mixture of Agents (MoA,
   Wang 2024)**, **Universal Self-Consistency (Chen 2023)**.
2. **The measured quality gap** between 1 strong verifier and N weak
   verifiers — what's the break-even N, and does it depend on task type
   (code review, factual QA, summarization, creative writing)?
3. **Voting schemes** — majority, confidence-weighted, veto-by-any-dissent,
   token-level intersection. Which scheme gives the highest quality-per-dollar?
4. **Latency considerations** — 5 parallel Haiku calls finish in ~
   max(individual latency) ≈ 1.5 × single-call latency, not 5 ×. At what
   swarm size does the latency curve flatten?
5. **Provider diversity as a quality signal** — does mixing 2 Haiku + 2
   Gemini-Flash + 1 local qwen give better dissent-detection than 5 Haiku?
6. **Integration sketch** — a `verifyFn` variant that issues N parallel
   verify calls, aggregates the A/P/R responses, and escalates to Opus
   only on **split votes**.

## Output artifacts we need

- A **cost curve**: quality-at-parity-Opus vs. swarm size, 1 → 10.
- A **decision rule** for when to escalate to Opus ("escalate if ≥ 2 votes
  disagree" vs. "escalate if the top answer's confidence < 0.7" etc.).
- A **code sketch** for `src/core/verify-swarm.ts` that implements the
  aggregator and plugs into the existing DVP module via the `verifyFn`
  seam.
- A **failure-mode section**: what breaks swarm verification? (Systematic
  biases across providers, rare-token tokenization mismatches, latency
  spike propagation.)

## Out of scope

- Any training of a custom verifier model.
- Swarm **drafting** (too many independent drafts are usually lower quality
  than one coherent one; this research is only about verification).

## Known constraints

- Budget target: swarm verify should cost **< 40% of a single Opus verify**
  on the same prompt to be worth the engineering cost.
- Quality target: swarm verify should match Opus within **2% on MT-bench
  code tasks** or the research concludes "not worth it".
- Latency budget: swarm verify can take up to **2× the single-Opus latency**.

---

## `docs/research/2026-04-17/r09-streaming-early-stop.md`

# R09 — Stream severing and programmatic early-stop

## Context

Frontier APIs bill on **output tokens actually emitted**, not on a pre-declared
cap. When we set `max_tokens: 1024` and the model emits only 120, we pay for 120.
This is already aligned with our interest.

But the inverse problem is live: when a model has **emitted enough to be
correct** at token 90 of a 500-token continuation, we still pay for the rest.
Streaming APIs let the client **cancel mid-response**, stopping the server-side
generation and capping the bill. Almost nobody uses this.

**Concrete example.** A verifier emits `A` on the approve path. With our
shorthand protocol, that's 1 token — no problem. But on the **rewrite path**,
a verifier emits ~500 tokens. If we can detect at token ~80 that the answer
is on track ("looks like it's producing the right structure, the stop-gram is
correct, semantic-parser agrees"), we can **sever the stream** and save the
remaining ~420 tokens at $75/M = **$0.0315 per severed call**.

**Question:** What are the published techniques for **programmatic early-stop**
on streaming LLM APIs — stop-grams, confidence thresholds, structural-parser
acceptance, redundancy detection — and what's the measured cost saving on
real agent workloads?

## What to research

1. **Stop-sequence tuning.** Anthropic, OpenAI, and Vertex all accept
   `stop_sequences`. The naive use is to stop on `</answer>` or `"}`. The
   aggressive use is to stop on **semantic tokens that indicate the model is
   about to pad** — "Additionally", "Furthermore", "I hope this helps", "In
   summary". What's the published savings on padding-detection stops?
2. **Client-side streaming severance.** Every SDK lets you call `.close()` on
   the stream. Is there a **billing guarantee** that server-side generation
   actually halts, or does the server keep generating and only stop sending?
   Anthropic's docs vs. OpenAI's docs vs. Google's — which one **guarantees
   you stop paying the moment you close**?
3. **Structural early-stop via partial parsing.** For structured outputs
   (JSON, XML, shorthand), the client can feed each streamed token into an
   **incremental parser** and detect "the answer is structurally complete"
   before the model emits its natural stop. What's the acceptance rate on
   real traffic? (We already do this implicitly with `parseShorthand` — an
   `A` is structurally complete at token 1.)
4. **Confidence-based early-stop.** At each streamed token, some APIs expose
   **log-probabilities of the top-K alternatives**. If the distribution at
   token N is very peaked on a known-good continuation, the client can stop
   and assume the rest. Published papers: **CALM (Confident Adaptive
   Language Modeling, Schuster 2022)**, **Early-Exit Transformers**.
5. **Redundancy detection.** A model that has already answered the question
   often starts **repeating itself** in softer wording. N-gram redundancy
   detection on the streamed tail ("the model has already said the same
   thing in 3 different ways") triggers a cut. Published benchmarks:
   **Rep-Penalty** family, though that's a training-time technique.
6. **Integration sketch** — where in this repo does early-stop live? A
   natural home is a wrapper around `streamingComplete()` that takes a
   **stop-predicate** (function of the streamed tokens so far) and calls
   `.close()` the moment the predicate fires. Compose with DVP so the
   verifier stream stops the instant `A\n` or the patch closes.

## Output artifacts we need

- A **provider matrix**: which providers **guarantee billing stops** when the
  client severs the stream, vs. which ones keep generating server-side.
- A **stop-predicate library sketch**: `onApproveShorthand`, `onStructurallyComplete`,
  `onPaddingDetected`, `onConfidenceAbove(0.98)`, `onRedundancy`. Each a pure
  function of the token stream so far.
- A **cost table**: baseline streamed call vs. each predicate's measured
  savings on our own trace (`.ai-cost-data/events.jsonl`).
- A **failure-mode section**: early-stop that truncates a correct answer,
  confidence spikes on a wrong answer, stop-grams that eat valid content.

## Out of scope

- Re-streaming partial output to recover truncated content (a separate
  retry-policy problem).
- GPU-side early-exit during local decode (we don't run our own inference).

## Known constraints

- Must work on both **streaming and non-streaming** provider calls — non-streaming
  falls back to `max_tokens` tuning, which is already in place.
- Early-stop must be **loss-preserving on the approve path** — we can't cut a
  call and then re-ask because we lost necessary tokens.
- Stop predicates must be **deterministic and cheap** — they run on every
  streamed chunk, so O(1) per token or it's a throughput regression.
- Observability: the wrapper must emit a `stream_severed` event with the
  predicate name and the savings so the evidence ledger can pick it up.

---

## `docs/research/2026-04-17/r10-hierarchical-planner-executor.md`

# R10 — Hierarchical planner (frontier, rare) + executor (local, many)

## Context

Agentic coding workflows currently treat every turn as equally valuable and
send them all to the frontier. In practice, the work decomposes:

- **1–2 turns per session are architectural** — "figure out the shape of the
  fix", "which files are involved", "what's the API surface". These want
  Opus-level reasoning.
- **10–50 turns per session are mechanical** — "apply this rename", "add a
  test for this function", "move this import", "rerun the linter and fix the
  three warnings it prints". These want a fast, cheap, competent executor.

Today, both get Opus input+output tokens. The architectural turns earn the
price; the mechanical ones don't.

**Question:** Can we formalize a **hierarchical agent topology** — a *planner*
that is called **rarely** on the frontier to produce a structured plan, and
an *executor* that runs **many times** on local models to execute each plan
step — in a way that preserves plan quality while driving executor-turn cost
to near-zero?

This is the classic **HRL (Hierarchical Reinforcement Learning)** pattern
ported to LLM agents. The thesis is that the planner's token count grows
with task **novelty**, and the executor's token count grows with task
**length**, and those two factors scale independently.

## What to research

1. **Published hierarchical-agent architectures (2024–2026).** Named systems:
   **Plan-and-Solve (Wang 2023)**, **ReAct + Reflexion hybrid**, **ADaPT
   (Prasad 2024)**, **HuggingGPT-style orchestrator-executor**, **Microsoft
   AutoGen group-chat with a "proxy" + "worker" split**, **LangGraph's
   supervisor-worker pattern**, **CrewAI's crew+task architecture**.
2. **The measured quality gap** between a one-shot frontier run and a
   planner-then-executor split where the executor is a 7B local model.
   At what task complexity does the split start to lose quality? (Intuition:
   short tasks lose a little; long tasks win a lot because the planner's
   structure protects the executor from drift.)
3. **The plan representation.** Is the plan a **list of tool calls**, a
   **dataflow graph**, a **natural-language script**, or **typed state
   transitions**? Which representation gives the executor enough structure
   to not need to re-plan but enough freedom to recover from small errors?
4. **Replan triggers.** When does an executor need to call back up to the
   planner? Candidates: unexpected tool-call error, budget overrun,
   confidence drop on a verification step, detection of a scope change
   in the user's follow-up message.
5. **Integration with DVP and semantic cache.** DVP is a **per-turn**
   optimization. Hierarchical is a **per-session** optimization. They
   compose — each executor turn can itself use DVP internally. Semantic
   cache lives at the executor layer (lots of similar small tasks) or at
   the planner layer (rarely — plans are usually novel).
6. **Economic model.** Per-session cost =
   `1 × planner_cost + N × executor_cost + M × replan_cost`
   — derive the break-even N as a function of planner model, executor model,
   and re-plan rate. At what executor/planner ratio does the architecture
   pay for its engineering cost?

## Output artifacts we need

- A **task-complexity axis**: where on the "one simple edit" → "multi-file
  refactor" → "cross-package overhaul" spectrum does hierarchical win,
  tie, or lose?
- A **cost model**: per-session cost as a function of (planner model,
  executor model, N executor turns, M replan rate).
- A **concrete code sketch** for `src/core/planner-executor.ts`:
  - `planOnce(task) → Plan` (one frontier call, structured output)
  - `executeStep(plan, i) → StepResult` (one local call + optional DVP verify)
  - `shouldReplan(stepResult) → boolean` (predicate over step outcome)
  - bus events so the ledger can separate planner-cost from executor-cost
- A **comparison table** vs. our current architecture: `draft-verify-patch`
  alone vs. `planner-executor + DVP inside each executor turn`.

## Out of scope

- Training a specialist planner or executor model.
- Multi-agent debate among planners (that's R08's neighborhood).
- Long-running agents with tool use beyond what we already support.

## Known constraints

- Must run on **our existing Ollama + frontier split** — no new infra.
- Executor must be **stateless per-step**: each step gets the full plan +
  its step index as input, so an executor failure doesn't require
  re-planning from scratch.
- Quality gate: on the `dev-01..dev-03` and `cowork-01..cowork-03`
  benchmarks, hierarchical must match within **5% of the single-frontier
  baseline** or we don't ship it.
- Observability gate: per-session breakdown of `planner_tokens`,
  `executor_tokens`, `replan_count`, `plan_steps_completed` must flow
  through to `evidence/20_throughput_cost_ledger.md` automatically.

---

## `docs/research/2026-04-17/r11-r35-ranked-prompt-pack.md`

# Gemini Deep Research — Round 3 (Ranked Prompt Pack)

This pack adds **25 more prompts** beyond R06-R10, ranked from most likely to
least likely to reduce go-forward token expense in this repo's likely future
workflows.

The ranking is intentionally biased toward two things:

1. **PPT/deck quantization** — turning raw `.pptx` / PDF / screenshot-heavy
   artifacts into compact, reusable, evidence-aware packets before a frontier
   model ever sees them.
2. **Really good decks, fast** — building high-quality slides through donor
   retrieval, constrained composition, local extraction, and cheap-first QA so
   we stop paying frontier models to repeatedly rediscover deck structure.

Paste one `Rxx` section at a time into Gemini Deep Research.

## Ranked list

| Rank | ID | Thesis being tested |
|---|---|---|
| 1 | R11 | Build a **deck packetizer** that converts PPTX/PDF into a compact slide IR before any premium-model call. |
| 2 | R12 | Stop re-sending whole decks: use **slide delta encoding** and revision packets for iterative deck work. |
| 3 | R13 | Build decks from **donor-slide retrieval + clone synthesis** instead of blank-page prompting. |
| 4 | R14 | Compile brief -> storyboard -> deck through a **constrained slide grammar** rather than open-ended generation. |
| 5 | R15 | Use **layout-aware extraction** to separate text, shapes, charts, and speaker notes locally. |
| 6 | R16 | Replace screenshot reading with **chart/table structural extraction**. |
| 7 | R17 | Cache brand and design context via a **style-token dictionary** instead of repeating deck instructions. |
| 8 | R18 | Route only **cropped slide regions** to premium models instead of full-slide screenshots. |
| 9 | R19 | Use a **cheap slide-critic loop** for quality gating before frontier escalation. |
| 10 | R20 | Run **programmatic deck QA** so we stop paying for visual review on every revision. |
| 11 | R21 | Generate presenter notes and talk tracks from structured slide IR, not from full deck re-reads. |
| 12 | R22 | Convert transcript/evidence -> brief -> deck in a two-step pipeline that minimizes frontier tokens. |
| 13 | R23 | Move deck production into an **overnight deck factory** with local first-pass workers. |
| 14 | R24 | Create one normalized **artifact IR** for PPTX/PDF/DOCX so packetizers are reusable. |
| 15 | R25 | Make deck artifacts **content-addressed and cacheable** across sessions and projects. |
| 16 | R26 | Distill long decks with **map-reduce slide summarization** instead of whole-deck prompts. |
| 17 | R27 | Cluster slide families into canonical exemplars to reduce "style rediscovery" tokens. |
| 18 | R28 | Linearize SmartArt/diagrams into graph form before asking a premium model to explain them. |
| 19 | R29 | Extract claim tables and evidence links directly from slides to reduce narrative hallucination loops. |
| 20 | R30 | Use visual similarity search across exemplar decks for "good slides fast." |
| 21 | R31 | Patch edited decks incrementally instead of regenerating changed slides from scratch. |
| 22 | R32 | Use local VLM passes for raster-heavy slides before any frontier escalation. |
| 23 | R33 | Plan narrative first, then materialize visuals later, to avoid expensive blank-canvas deck generation. |
| 24 | R34 | Build a multimodal benchmark harness so deck quantization wins are measurable, not vibe-based. |
| 25 | R35 | Find the cost/quality frontier for "really good decks fast" across local, cheap, and premium stacks. |

## R11 — PPTX/PDF packetizer: compact slide IR before any premium call

**Why this matters:** A raw deck is mostly waste: repeated masters, duplicated
logos, decorative geometry, and long slides where only 1-2 claims matter. If we
can convert decks into a compact, reusable slide IR, we can treat deck work the
same way we already treat repo compression.

**Question:** What is the best 2024-2026 architecture for a local-first
`pptx -> slide IR -> minimal packet` pipeline that preserves layout, claims,
notes, chart meaning, and slide relationships while reducing frontier input by
10x-100x on deck tasks?

**What to research**

1. Named techniques and tools for PPTX parsing, PDF layout extraction, OCR, and
   multimodal document IRs.
2. Best practices for representing each slide as:
   `slide_id`, purpose, title, bullet tree, claims, visuals, notes, sources,
   layout slots, and change hash.
3. Published or production-style measurements on token reduction and answer
   quality when using slide summaries / structural packets vs. raw slides.
4. Integration sketch for `src/scan/pptx.ts`, `src/core/slide-ir.ts`, and
   `src/core/deck-packet.ts`.

**Output artifacts we need**

- A recommended IR schema for a slide and a deck.
- A cost table: raw deck vs. slide IR vs. task-specific packet.
- A go/no-go note on whether this should become the repo's next major wedge.

**Known constraints**

- Must work even when the source is a messy customer deck or exported PDF.
- Must preserve enough fidelity for executive-slide work, not just summaries.
- Must degrade gracefully when charts or images cannot be perfectly parsed.

## R12 — Deck delta encoding: stop re-sending whole decks on every revision

**Why this matters:** Deck work is revision-heavy. Most turns change 1-3 slides,
yet the premium model is often shown the entire deck again.

**Question:** What is the best way to compute and packetize slide-level deltas so
that "revise this deck" workflows only send changed slides plus a tiny deck map,
instead of paying to replay the entire presentation each turn?

**What to research**

1. Techniques for PPTX structural diffing, slide hashing, and asset-level change
   detection.
2. Whether slide-order changes, theme changes, or master-slide changes require
   special diff logic.
3. Best packet formats for revision tasks: changed slide IR + neighboring slide
   summaries + deck objective + style anchors.
4. Integration sketch for `src/core/deck-diff.ts` and `src/core/deck-revision-packet.ts`.

**Output artifacts we need**

- A diff algorithm recommendation with failure modes.
- A revision-packet schema with example before/after payload sizes.
- Rules for when to escalate from delta packet back to full-deck packet.

**Known constraints**

- Must survive copy/paste slide duplication and slide renumbering.
- Should be content-addressed so repeated edits hit cache.
- Must be simple enough to explain on a future deck/product story.

## R13 — Donor-slide retrieval and clone synthesis for "good slides fast"

**Why this matters:** Blank-page prompting is expensive and low quality. The
fastest path to a good slide is often "find the nearest proven donor slide and
adapt it."

**Question:** What is the best retrieval-and-clone architecture for building
high-quality decks from donor slides, with minimal frontier tokens and minimal
visual drift?

**What to research**

1. Best-of-breed systems for slide retrieval, slide embedding, and layout-aware
   nearest-neighbor search.
2. How teams clone slides safely across themes, fonts, and masters.
3. Whether donor selection should use text embeddings, visual embeddings, or a
   hybrid score.
4. Integration sketch for a local donor library plus `src/core/slide-retrieval.ts`.

**Output artifacts we need**

- A recommended donor-slide retrieval stack.
- A clone-and-fill workflow for title, evidence, visual, and talk-track slots.
- Measured expectations: speed gain, token savings, and quality gain vs. blank generation.

**Known constraints**

- Must preserve brand quality, not create malformed cloned slides.
- Needs a clear fallback when there is no good donor.
- Should support both Elastic-style decks and arbitrary customer/source decks.

## R14 — Brief-to-deck compiler: constrained slide grammar instead of open generation

**Why this matters:** "Make me a deck" is token-expensive because the model is
forced to invent narrative, slide types, layout, and wording all at once.

**Question:** Can we reduce cost and improve quality by compiling a short brief
into a storyboard using a fixed slide grammar, then materializing only the
approved slide slots?

**What to research**

1. Structured presentation grammars: title/context/problem/proof/ask/comparison/
   roadmap slide families.
2. How modern deck-generation systems separate planning from rendering.
3. Whether a slide DSL or JSON schema materially reduces iteration tokens.
4. Integration sketch for `src/core/deck-plan.ts`, `src/core/slide-grammar.ts`,
   and a future `src/core/deck-builder.ts`.

**Output artifacts we need**

- A recommended slide grammar for strategy / VE / exec-review decks.
- A planner/executor split for deck building, with cost expectations.
- A failure-mode section on when constrained grammars make decks worse.

**Known constraints**

- Must still allow surprisingly good decks, not just formulaic ones.
- Grammar should be small enough to learn and debug.
- Should connect cleanly to evidence-grounded slides.

## R15 — Layout-aware extraction for decks: text, shapes, charts, notes, and masters

**Why this matters:** Most deck prompts flatten everything into OCR text, which
throws away the slide's actual structure and causes repeated "what is this
slide really doing?" reasoning.

**Question:** What extraction stack best preserves deck semantics locally:
textbox hierarchy, z-order, shape types, notes, charts, tables, masters, and
asset references?

**What to research**

1. PPTX and PDF parsers that preserve geometry and object identity well.
2. Techniques for reconstructing bullet trees, callout relationships, and slide
   regions from layout primitives.
3. Whether notes pages should be treated as first-class context for slide meaning.
4. Integration sketch for extraction modules feeding `slide-ir.ts`.

**Output artifacts we need**

- A field-by-field extraction spec for local deck parsing.
- A confidence model for partially parsed objects.
- A recommendation on what to store vs. compute lazily.

**Known constraints**

- Some decks are image-heavy or flattened; OCR fallback is unavoidable.
- We need an IR that can support both analysis and generation.
- Parsing should be cheap enough to run on every ingest.

## R16 — Chart and table structural extraction instead of screenshot interpretation

**Why this matters:** Charts and tables are often the single most valuable parts
of a slide, yet sending screenshots forces a premium model to visually infer
data that already exists as structure somewhere.

**Question:** How can we extract charts and tables from PPTX/PDF into structured
representations so frontier models reason over small JSON/CSV packets instead of
large screenshots?

**What to research**

1. Best tools for extracting chart series, labels, table cells, legends, and
   axes from PowerPoint and PDFs.
2. Chart-understanding and table-understanding benchmarks that compare
   structural inputs vs. raw image inputs.
3. Fallback methods when only a rasterized chart exists.
4. Integration sketch for `src/core/chart-extract.ts` and `src/core/table-extract.ts`.

**Output artifacts we need**

- A structured chart/table schema.
- A cost and quality comparison: screenshot prompt vs. structured prompt.
- A decision rule for when a screenshot is still necessary.

**Known constraints**

- Some executive slides intentionally simplify or distort chart detail.
- Must preserve source linkage for evidence-sensitive work.
- Needs to help both analysis tasks and deck-building tasks.

## R17 — Brand and style-token dictionary: cache the design language

**Why this matters:** A lot of deck-generation spend is really style replay:
fonts, tone, logo rules, spacing, slide archetypes, do/don't lists.

**Question:** Can we compress design system knowledge into a stable token-light
dictionary so deck jobs reference style IDs and layout archetypes rather than
re-sending long style instructions every time?

**What to research**

1. Best practices for encoding brand systems, layout archetypes, and slide types
   as compact symbolic tokens.
2. Whether teams see quality loss when models are given short style codes plus
   exemplars rather than verbose style prose.
3. How style tokens interact with donor-slide retrieval and clone synthesis.
4. Integration sketch for `src/core/style-tokens.ts`.

**Output artifacts we need**

- A compact style-token schema for deck generation.
- A packet example showing before/after token counts.
- A recommendation on whether style tokens should be local only or exposed to
  the model.

**Known constraints**

- Must not be so compressed that humans cannot understand or debug it.
- Needs to preserve brand nuance, not reduce everything to "corporate deck."
- Should support multiple customers/brands over time.

## R18 — Region-of-interest routing: send only cropped slide areas to premium models

**Why this matters:** Many deck questions are local: "fix this chart," "rewrite
this right column," "what's wrong with this headline?" Full-slide screenshots are
waste when only one region matters.

**Question:** What is the best architecture for detecting and routing only the
relevant slide regions or objects to premium models, with the rest supplied as
compact structural context?

**What to research**

1. ROI detection techniques for slide editing, analysis, and QA tasks.
2. Whether object bounding boxes from PPTX parsing outperform image-based crop
   detection for deck tasks.
3. Best packet shapes: crop image + local slide summary + style anchors.
4. Integration sketch for `src/core/slide-roi.ts`.

**Output artifacts we need**

- A region-routing decision matrix by task type.
- Expected token savings on common deck edit requests.
- Failure modes where crops lose necessary slide context.

**Known constraints**

- Must preserve enough surrounding context to avoid local optimum edits.
- Has to work on PDFs and screenshots where native object boxes are absent.
- Crop selection should be deterministic when possible.

## R19 — Cheap slide-critic loop before frontier escalation

**Why this matters:** Many revisions are obvious and should be caught by cheap
models or deterministic checks before a premium reviewer is asked to opine.

**Question:** Can a cheap-first critic loop catch most deck-quality issues
(clarity, overflow, alignment, weak headline, unsupported claim, style drift)
before we spend premium tokens?

**What to research**

1. Existing slide evaluation frameworks and deck-critique heuristics.
2. Whether local or flash-tier models can reliably detect common slide flaws.
3. Multi-stage critic patterns: deterministic checks -> cheap model critic ->
   premium critic only on low-confidence cases.
4. Integration sketch for `src/core/deck-eval.ts`.

**Output artifacts we need**

- A critic rubric tailored to exec decks and evidence-heavy slides.
- A costed escalation ladder with estimated capture rate.
- A confidence or disagreement policy for premium escalation.

**Known constraints**

- Must distinguish substantive critique from stylistic nitpicking.
- False positives can create too many useless revision loops.
- Should plug into future deck-builder workflows.

## R20 — Programmatic deck QA without full-slide re-review

**Why this matters:** If we can deterministically check overflow, missing notes,
mis-cited numbers, broken masters, and low-information slides, we can reserve
premium review for genuinely ambiguous slide quality.

**Question:** What programmatic QA suite should exist for PPTX/PDF deliverables
so the model only reviews exception cases?

**What to research**

1. Deterministic checks for slide count, title presence, footers, source
   references, text overflow, chart legibility, contrast, and master consistency.
2. Existing libraries or workflows for render-and-inspect PPTX QA.
3. How much QA load can be shifted from model review to code.
4. Integration sketch for `src/core/deck-qa.ts`.

**Output artifacts we need**

- A prioritized QA checklist with expected savings impact.
- A proposed output format for deck QA findings.
- A rule for when QA should auto-fail vs. request model judgment.

**Known constraints**

- Some quality failures are semantic, not structural.
- Render-based QA must stay fast enough for iterative use.
- The suite should support both internal drafts and external-ready decks.

## R21 — Presenter notes and talk-track synthesis from structured slides

**Why this matters:** Notes generation often re-reads the whole deck. If notes
can be synthesized from slide IR plus deck objective, we avoid another full-deck
premium pass.

**Question:** What is the best pipeline for generating presenter notes, speaker
talk tracks, and handoff scripts from slide IR instead of raw deck content?

**What to research**

1. Techniques for narrative synthesis from slide outlines and evidence anchors.
2. Whether notes should be built slide-by-slide, section-by-section, or from a
   deck-level thesis map.
3. Quality benchmarks or case studies on structured-note generation.
4. Integration sketch for `src/core/presenter-notes.ts`.

**Output artifacts we need**

- A recommended note-generation workflow.
- A cost comparison vs. naive "read this whole deck and write notes."
- Failure modes for notes that overfit the slide wording.

**Known constraints**

- Notes must sound like a human presenter, not a transcript robot.
- Should respect evidence grades and avoid inventing claims.
- Needs to stay synchronized with revised slides.

## R22 — Transcript/evidence -> brief -> deck: a two-step pipeline

**Why this matters:** Deck work often starts from noisy transcripts, docs, and
evidence packs. Going straight from those raw materials to slides is expensive.

**Question:** Can we materially cut token cost by first producing a compact
brief packet, then building the deck from that brief, instead of generating
slides directly from raw transcripts and source dumps?

**What to research**

1. Two-step and multi-step presentation-generation systems.
2. Whether brief packets retain enough nuance for high-quality deck writing.
3. Techniques for preserving source traceability from transcript to slide claim.
4. Integration sketch for `src/core/brief-packet.ts` feeding a future deck builder.

**Output artifacts we need**

- A best-practice brief schema for deck generation.
- A cost and latency comparison: direct generation vs. brief-first.
- A recommendation on which source types benefit most from this split.

**Known constraints**

- Some nuance may be lost in the brief if compression is too aggressive.
- We need easy source backtracking for executive and evidence-sensitive slides.
- The brief must stay reusable across multiple deliverables.

## R23 — Overnight deck factory: local-first asynchronous build pipelines

**Why this matters:** Non-urgent deck prep can happen overnight on cheap/local
compute so the premium model only handles bottlenecks and last-mile polish.

**Question:** What is the best overnight deck-production operating model for
batch extraction, donor retrieval, first-pass assembly, and cheap QA, so human
or premium review starts from a high-quality draft in the morning?

**What to research**

1. Asynchronous content-factory patterns for slide generation and review.
2. How much of deck prep can be front-loaded onto local or flash-tier models.
3. Queue designs that separate deterministic prep from premium reasoning.
4. Integration sketch with this repo's bridge/queue/worker architecture.

**Output artifacts we need**

- An overnight deck pipeline diagram with model allocation by step.
- Expected savings vs. synchronous premium-first deck work.
- Operational risks: stale context, failed jobs, low-quality drafts.

**Known constraints**

- Must be auditable; overnight work cannot be a mystery box.
- Queue throughput matters as much as per-task cost.
- Should fit the Mac Mini + laptop split already present in the repo.

## R24 — Unified artifact IR for PPTX, PDF, DOCX, and long-form docs

**Why this matters:** Decks rarely live alone. The same packetizer ideas should
work across decks, PDFs, and briefing docs so we do not build one-off pipelines.

**Question:** What normalized artifact representation best unifies decks, PDFs,
and docs for downstream packetization, retrieval, caching, and evidence linking?

**What to research**

1. Multimodal document IR work from 2024-2026.
2. Whether a shared schema meaningfully improves engineering velocity and token
   savings across artifact types.
3. How to represent sections, slides, pages, figures, tables, claims, notes,
   and citations in one graph.
4. Integration sketch for `src/core/artifact-ir.ts`.

**Output artifacts we need**

- A proposed cross-artifact schema.
- Tradeoffs vs. deck-specific or PDF-specific pipelines.
- A migration recommendation for this repo's current scan/preprocess layers.

**Known constraints**

- The IR cannot become so abstract that it is useless in implementation.
- Cross-artifact elegance matters less than real packet savings.
- Must support provenance and stable IDs.

## R25 — Content-addressed deck cache and provenance graph

**Why this matters:** If the same deck, donor slide, or evidence snippet is read
across many sessions, we should pay once to distill it and reuse the result.

**Question:** What caching and provenance model best supports reusable deck
artifacts, stable packet hashes, and cross-session retrieval without quality
drift?

**What to research**

1. Content-addressed cache patterns for multimodal artifacts.
2. How to separate immutable artifact summaries from mutable task packets.
3. Provenance graphs linking slide claims back to sources and extraction steps.
4. Integration sketch for cache layers around future deck packetizers.

**Output artifacts we need**

- A cache design with invalidation rules.
- A provenance model for slide/object/source linkage.
- Cost scenarios showing repeated-use savings.

**Known constraints**

- Cache corruption or stale packets are dangerous in exec material.
- Must interoperate with provider-native prompt caching where possible.
- Needs human-debuggable artifact lineage.

## R26 — Map-reduce deck distillation for long presentations

**Why this matters:** Long decks (30-200 slides) often blow context windows or
cause the model to waste tokens keeping every slide alive at once.

**Question:** What is the best map-reduce style strategy for long-deck analysis:
slide summaries -> section summaries -> deck thesis map -> task packet?

**What to research**

1. Hierarchical summarization strategies for presentations and long docs.
2. Quality tradeoffs vs. giving the model the entire deck.
3. Best chunking schemes: slide, section, story arc, appendix split.
4. Integration sketch for hierarchical deck distillation modules.

**Output artifacts we need**

- A recommended distillation ladder for short, medium, and long decks.
- Measured or cited quality-retention numbers if available.
- Escalation rules for when full-slide detail must be restored.

**Known constraints**

- Summaries can erase rhetorical force or slide-to-slide tension.
- Some deck tasks depend on appendix slides that look irrelevant at first.
- The final packet must still support precise rewrite requests.

## R27 — Slide family clustering: canonical exemplars instead of repeated style rediscovery

**Why this matters:** Teams repeatedly ask models to create "a title slide," "a
2x2 comparison," "an agenda," "a proof slide." Those are stable slide families.

**Question:** Can we cluster historical slide libraries into canonical families
and exemplars so new deck work starts from a small menu of known-good patterns?

**What to research**

1. Techniques for clustering slides by narrative role, layout, and visual form.
2. Whether family labels improve retrieval and generation quality.
3. How to derive canonical exemplars without overfitting to one brand.
4. Integration sketch for a local slide-family index.

**Output artifacts we need**

- A taxonomy of high-value slide families.
- A recommendation on text/visual/hybrid clustering methods.
- Expected savings from canonical exemplar reuse.

**Known constraints**

- Clustering must remain interpretable to humans.
- Needs to support both broad strategy decks and tactical proofs.
- Canonical exemplars should not freeze creativity entirely.

## R28 — Diagram and SmartArt linearization before model reasoning

**Why this matters:** Diagrams are expensive to understand visually, but many
can be linearized as graphs, flows, or layered relationships.

**Question:** How should we convert SmartArt, process diagrams, and architecture
slides into graph-structured packets that are cheaper to reason over than raw
images?

**What to research**

1. Graph extraction or diagram-understanding methods for slides.
2. Whether shape connectors, labels, and spatial relationships can be reliably
   reconstructed from PPTX.
3. How graph packets compare to screenshot prompts for architecture explanation
   and rewrite tasks.
4. Integration sketch for a `diagram-ir` extension of slide IR.

**Output artifacts we need**

- A graph schema for common deck diagrams.
- A cost/quality comparison vs. visual-only interpretation.
- Fallback rules for diagrams that resist clean extraction.

**Known constraints**

- Many diagrams are manually drawn and semantically ambiguous.
- Over-linearization can lose visual emphasis and hierarchy.
- Must support both understanding and re-expression.

## R29 — Claim-and-evidence extraction directly from slides

**Why this matters:** When a model cannot tell which slide text is a claim, a
label, a caption, or decorative filler, it wastes tokens and invents support.

**Question:** What is the best way to extract slide claims, supporting numbers,
footnotes, and evidence links into a structured claim bank before narrative work
begins?

**What to research**

1. Methods for claim detection and evidence-link extraction from presentation artifacts.
2. How slide claims can be cross-checked against local evidence ledgers.
3. Whether structured claim banks materially reduce revision loops and hallucinations.
4. Integration sketch that bridges future deck parsing with `evidence/`.

**Output artifacts we need**

- A proposed slide-claim schema.
- A flow that links extracted claims to local evidence artifacts.
- A recommendation on what stays deterministic vs. model-inferred.

**Known constraints**

- Many slides imply claims rather than stating them cleanly.
- Evidence may live outside the slide in notes or source docs.
- False confidence here would be dangerous in external decks.

## R30 — Visual similarity search across exemplar decks

**Why this matters:** "Really good decks fast" often means finding the nearest
visual precedent and adapting it, not asking a model to imagine a layout from
scratch.

**Question:** What retrieval stack best supports visual similarity search across
decks so we can find the right slide pattern in seconds and reduce generation
tokens?

**What to research**

1. Slide-thumbnail embedding methods and visual retrieval systems.
2. Whether combined text + visual retrieval outperforms either alone.
3. How visual retrieval should interact with donor-slide cloning.
4. Integration sketch for an exemplar search index.

**Output artifacts we need**

- A recommended visual retrieval approach.
- A workflow for "find donor, adapt, QA" with minimal premium involvement.
- Savings and speed expectations compared with blank generation.

**Known constraints**

- Similar-looking slides can have very different rhetorical jobs.
- Retrieval quality matters more than theoretical novelty here.
- Needs a way to filter by brand and audience.

## R31 — Human-edit aware patching of decks

**Why this matters:** Once a human edits a deck, naive regeneration can destroy
good local choices and force expensive repair loops.

**Question:** What architecture best supports patching only the necessary parts
of a deck while preserving human edits and donor-slide integrity?

**What to research**

1. Structural patching methods for PPTX and JSON-like slide representations.
2. How to mark protected regions, locked objects, or "human-owned" slide zones.
3. Whether patch-based deck editing reduces total revision tokens materially.
4. Integration sketch for a future patch engine.

**Output artifacts we need**

- A patch model for slides and decks.
- A conflict-resolution policy between human edits and model suggestions.
- Cost savings expectations vs. regenerate-the-slide workflows.

**Known constraints**

- Human edits may not preserve structural cleanliness.
- Patching must not silently break masters or alignment.
- A patch model is only useful if it is easy to inspect.

## R32 — Local VLM prepass for raster-heavy slides

**Why this matters:** Some decks are mostly images, screenshots, or flattened
PDF pages. A cheap local VLM may be able to produce a first-pass semantic
summary before any premium model sees the artifact.

**Question:** What local or near-free VLM stack best handles raster-heavy slides
well enough to reduce premium multimodal token usage?

**What to research**

1. 2024-2026 local VLMs suitable for Mac Mini or adjacent hardware.
2. Benchmark performance on slide screenshots, UI captures, and dense charts.
3. Hybrid patterns where the local VLM produces candidate regions, captions, or
   summaries and the premium model only validates edge cases.
4. Integration sketch for a local multimodal scout stage.

**Output artifacts we need**

- A shortlist of viable local VLMs and expected throughput.
- A cheap-first multimodal routing ladder.
- Risks: hallucinated OCR, missed small text, chart misreads.

**Known constraints**

- Local VLM quality may be too weak for final judgment.
- Must fit the repo's local hardware assumptions.
- Success here is measured in premium-token avoidance, not perfect autonomy.

## R33 — Storyboard first, visuals later: planning split for deck creation

**Why this matters:** Expensive deck creation often comes from asking for full
slides too early. A lightweight storyboard may lock the narrative faster and
cheaper.

**Question:** For high-quality deck creation, how much should we separate
storyboard planning from final visual materialization, and what is the best
cost/quality split across models?

**What to research**

1. Storyboard-first presentation workflows used by high-performing teams.
2. Whether cheap models can produce adequate storyboard passes for premium or
   human polishing.
3. How many iterations should happen before visual assembly begins.
4. Integration sketch connecting storyboard artifacts to donor-slide retrieval.

**Output artifacts we need**

- A recommended deck-build sequence by stage.
- A cost model for storyboard-first vs. full-slide-first workflows.
- Failure modes where delayed visual work hurts deck quality.

**Known constraints**

- Some slides only make sense once the visual form exists.
- The storyboard must still carry evidence and audience intent.
- Humans need to understand and approve the intermediate artifact quickly.

## R34 — Multimodal benchmark harness for deck quantization

**Why this matters:** If we start building PPT quantization features, we need a
way to prove that they reduce cost without trashing slide quality or task
accuracy.

**Question:** What benchmark harness should we build to measure deck packet
quality, token savings, latency, and outcome quality across analysis and
generation tasks?

**What to research**

1. Existing multimodal document and slide benchmarks relevant to this problem.
2. Task families that matter most here: deck summarization, slide rewrite,
   evidence extraction, donor retrieval, visual QA, and notes generation.
3. Metrics: token reduction, latency, answer fidelity, visual quality, human
   preference, and failure-rate under compression.
4. Integration sketch for a local benchmark harness analogous to the repo's
   current evidence and benchmark flows.

**Output artifacts we need**

- A benchmark plan with task matrix and scoring.
- Suggested gold datasets or ways to create them locally.
- A recommendation on the minimum bar for shipping deck quantization features.

**Known constraints**

- Benchmark creation itself should not become the whole project.
- Human judgment will still be needed for some slide-quality metrics.
- We need signals that are sensitive enough to catch silent quality loss.

## R35 — Cost/quality frontier for "really good decks fast"

**Why this matters:** The end goal is not just "cheaper deck tokens." It is
"really good decks fast" at the lowest sustainable cost.

**Question:** Across local models, flash-tier models, premium models, donor-slide
libraries, and deterministic tooling, what stack gives the best cost/quality
frontier for building excellent decks quickly?

**What to research**

1. Pipeline comparisons: premium-only, cheap-first, local-first, donor-heavy,
   template-heavy, and hybrid clone-and-polish systems.
2. Reported or measured throughput, quality, and operating complexity.
3. Which stages truly need frontier reasoning vs. which are habitually but
   unnecessarily premium today.
4. A final recommendation for this repo's likely next implementation tranche.

**Output artifacts we need**

- A stack-ranked recommendation for the next 3-5 implementation bets.
- A one-page decision table by task type: analyze deck, revise slide, build
  draft deck, polish final deck, generate notes.
- A candid "what not to build" section.

**Known constraints**

- This repo is not a generic slide startup; it needs a wedge that fits the
  cost-refinery story.
- Engineering complexity matters; the best theoretical stack may be a bad bet.
- The answer should favor systems that create reusable assets, not one-shot prompts.

---

## `docs/research/2026-04-17/r36-vqtoken-extreme-reduction.md`

# R36 — VQToken: extreme token reduction via neural discrete codebooks

**Paper:** Haichao Zhang & Yun Fu, "VQToken: Neural Discrete Token Representation
Learning for Extreme Token Reduction in Video Large Language Models."
arXiv:2503.16980v6 (29 Sep 2025). NeurIPS 2025. Northeastern University.
[Homepage](https://haichao-zhang.github.io/) · [Code](https://github.com/zhanghaichao/VQToken) · [HF Model](https://huggingface.co/).

## Why a video-LLM paper matters for KostAI

KostAI is a **text-LLM** cost-instrumentation library. VQToken compresses
**video-LLM** token sequences. At first glance: unrelated.

What carries over is the **measurement primitive** and the **task
formulation** — both are domain-agnostic. The paper's token-hashing and
vector-quantization modules are video-specific (they operate on ViT patch
embeddings); the surrounding evaluation framework is not. Two pieces
integrate cleanly:

1. **Token Information Density (TokDense)** — a per-call efficiency metric
   we don't currently expose but have all the raw fields to compute. It
   generalizes cleanly from `Accuracy / TokenCount` (video QA) to
   `QualityScore / TotalTokens` (KostAI's shadow-mode judged calls).

2. **Fixed-length vs. adaptive-length compression** — a task split we should
   mirror in our preprocess/router config. Fixed gives predictable cost;
   adaptive maximizes TokDense. Every KostAI user should be able to pick.

The **VQ-Attention + hash-index module** itself does not port to text. Text
tokens are already discrete — there's no continuous ViT embedding to cluster.
But the **architectural lesson** (cluster + reference by index, preserve
positional information in a lightweight side channel) is the same pattern as
our existing `blockHashes` field. This paper is a formal, citable validation
of the design KostAI already has.

## Measured numbers (from the paper)

On **NextQA-MC** with the LLaVA-OneVision 0.5B backbone:

| Variant             | Tokens | Token % | Accuracy | TokDense | Throughput |
|---------------------|-------:|--------:|---------:|---------:|-----------:|
| Baseline (LLaVA-OV) | 11,664 |  100%   |  58.58%  |  0.005   | 21.91 fps  |
| Token Pruning       |  1,152 |   10%   |  29.12%  |  0.025   | 46 fps     |
| ToMe                |  1,152 |   10%   |  35.72%  |  0.031   | 42 fps     |
| VidToMe             |  1,152 |   10%   |  39.64%  |  0.034   | 40 fps     |
| Interpolation       |  3,136 |   27%   |  57.20%  |  0.018   | 32 fps     |
| **Ours-Dynamic**    | **13.08** | **0.07%** | **57.72%** | **4.413** | 49 fps |
| **Ours-Fixed** (m=32) | 32   | 0.14%   |  57.46%  |  1.796   | 91 fps     |

Headline: **99.86% token reduction, 0.66% accuracy drop, 800× TokDense gain**.
Ablation (Table 6) shows the codebook + hash + VQ-attention are all load-bearing
— remove any one and accuracy collapses or token count balloons.

## Concepts that port to text-LLM cost observability

### 1. TokDense — the per-token efficiency metric

Paper definition:
```
TokDense = Accuracy / TokenCount
```

KostAI generalization (already derivable from existing fields):
```
TokDense = qualityScore / (inputTokens + outputTokens)     // per call
TokDenseFleet = Σ qualityScore / Σ totalTokens              // rollup
```

Why this is useful: a call can be *cheap* (low cost) and *inefficient* (low
quality) at the same time. KostAI currently exposes `efficiencyScore`
(a waste-heuristic-derived 0–100) and `costUsd`. It does **not** expose a
quality-per-token number. Two calls at the same `costUsd` that produce very
different-quality answers look identical in today's rollups. TokDense fixes
that.

This is a first-class metric addition. Shadow mode already writes
`qualityScore`; the denominator is trivially `totalTokens`. Emit it on the
event, roll it up per route, ship it to Kibana.

### 2. Fixed vs. adaptive token budgets

Paper distinction:

- **Fixed-length reduction**: preset token budget `m`. Predictable cost,
  uniform treatment of every input.
- **Adaptive-length reduction**: `m` is selected per-input based on content
  complexity. Higher average TokDense, variable per-call cost.

KostAI analog: `preprocess.maxSummaryTokens` is today a **fixed** cap. The
paper shows that an adaptive cap — choose `K` per input via something like
adaptive K-Means on content difficulty — strictly dominates fixed on
TokDense (4.413 vs. 1.796 in Table 3), at the cost of some per-call
variability. We should expose both modes in config:

```jsonc
"preprocess": {
  "enabled": true,
  "budgetMode": "adaptive",   // or "fixed"
  "maxSummaryTokens": 800,    // cap for adaptive, budget for fixed
  "targetTokDense": 0.12      // only used in adaptive mode
}
```

### 3. Module complexity vs. LLM complexity — scoring the reducer itself

The paper defines two separate complexity metrics:

- **Module Complexity**: cost of running the reduction step
- **LLM Complexity**: cost of running the downstream LLM on the reduced input

Translated to KostAI: a preprocess call that costs 500 input-tokens of a
local model to save 5,000 input-tokens on Opus is a clear win. A preprocess
call that costs 5,000 tokens to save 3,000 on Haiku is a loss — but nothing
in KostAI today detects that. This is a new waste category:

```
reduction_overhead_negative — the preprocess step cost more than it saved.
```

This is a one-function addition to `src/core/score.ts` and a one-row bump
to `WasteCategory`. Doesn't need new infrastructure.

### 4. The extreme-compression task as a benchmark

The paper introduces **Extreme Token Reduction (ETR)** as a task:
*compress `t → t'` such that `|t'| ≪ |t|` without losing downstream
accuracy*.

KostAI's existing benchmark_catalog has "wins on packet compression" tasks.
Adding an explicit ETR task — "compress the replayed history by ≥ 95% and
keep quality ≥ 85% of baseline" — gives us a named, published-in-literature
target to hit. The paper's 99.86% reduction is the **theoretical ceiling**
for our equivalent; realistic text-LLM targets are lower (60–80% on
replayed-history turns, matching our measured preprocess numbers).

## Concepts that do **not** port

- **Vector quantization of continuous ViT embeddings.** Text tokens are
  already discrete. There is nothing to cluster.
- **Token hash over `(f, h, w)` grid.** Text has no spatial-temporal grid.
  Our `blockHashes` field is the structural analog — it hashes normalized
  blocks, not grid cells.
- **Adaptive K-Means on patch embeddings.** Same reason. The adaptive-cap
  idea (§2 above) is the takeaway; the K-Means mechanism is video-specific.
- **VQ-Attention module.** Purely a neural-network component; we never run
  one.

## Integration sketch — what lands in the repo

1. **`src/core/types.ts`** — add `tokDense?: number` to `LLMCallEvent`.
2. **`src/core/score.ts`** — compute `tokDense = qualityScore / totalTokens`
   when `qualityScore` is present. Add `reduction_overhead_negative` to
   `WasteCategory`.
3. **`src/core/preprocess.ts`** — honor `budgetMode: "fixed" | "adaptive"`.
   Default stays `fixed` for backwards compatibility.
4. **`src/core/types.ts` (config)** — add `preprocess.budgetMode` and
   `preprocess.targetTokDense` fields.
5. **`evidence/benchmark_catalog.yaml`** — add a `dev-04-extreme-reduction`
   task that asserts ≥ 95% compression at ≥ 85% quality retention on a
   synthetic replayed-history fixture.
6. **`evidence/20_throughput_cost_ledger.md`** — reference row citing the
   paper's numbers as the **theoretical ceiling**, with our adjusted
   text-LLM targets.
7. **`web/index.html`** — add a "Research citations" sub-block under the
   Levers section linking to this prompt and the paper's arXiv ID.
8. **`docs/research/2026-04-17/README.md`** — index R36.

## Failure modes & what breaks

- **TokDense without a judge** — `qualityScore` is only populated when
  shadow mode or the evaluator runs. Rolling up TokDense on calls without
  a score gives a meaningless denominator. The rollup must filter to
  `event.qualityScore != null`.
- **Adaptive budget on short inputs** — if the input is already under
  the target, the reducer should early-exit, not waste tokens clustering
  a small set. Honor the existing `triggerMinInputTokens` guard.
- **reduction_overhead_negative false-positives** — a single bad call in a
  warm local model (cold-start compile of the Ollama session) can look
  overhead-heavy. Only flag after ≥ 3 consecutive negative-net calls on
  the same route.

## Why this is a free lever

Zero infrastructure change. TokDense uses fields we already capture. The
budget-mode flag is a config switch with a default that preserves current
behavior. The waste category is a pure function in `score.ts`. Every piece
of the integration is testable without a paid API call.

## Citation

```
@inproceedings{zhang2025vqtoken,
  title={VQToken: Neural Discrete Token Representation Learning
         for Extreme Token Reduction in Video Large Language Models},
  author={Zhang, Haichao and Fu, Yun},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025},
  note={arXiv:2503.16980v6}
}
```

---

## `docs/research/2026-04-17/r37-flattened-priors.md`

# R37 — Prior-art harvest from 6 flattened open-source projects

**Status:** in-progress — generated by reading `docs/research/2026-04-17/flat/*/repo.cxml`
artifacts produced by `scripts/rendergit/flatten_research.py`.

The six repos below are the ones `repos-llm-cost.txt` flagged as direct
precedents for wedges named in R06–R10 and the existing ledger. For each,
this doc captures:

- **Named technique** — the specific primitive, with a file:line citation
  into the flattened repo.
- **Cost-relevance** — which ai-cost wedge (DVP, KV, speculative, prompt-compress,
  router, semantic cache, constrained decode, quantization) it maps to.
- **Measured number** — from the repo's own benchmarks or linked papers.
- **Integration sketch** — what it would take to port the primitive behind
  the ai-cost API, or stand it up as a shadow-mode route.
- **Failure mode** — what the primitive silently degrades under.

Each repo section ends with a **Ledger-row candidates** subsection — draft
entries ready to be promoted into `evidence/20_throughput_cost_ledger.md`
once they're MEASURED or MODELED with a reproducible recipe.

---

## 1. vllm-project/vllm — PagedAttention + prefix caching

vLLM's core serving optimisations (PagedAttention, continuous batching,
KV offload, tree attention, FlashInfer kernels) are **self-hosted-only** —
they rely on kernel-level access the Anthropic/OpenAI APIs do not expose.
What does port are the **algorithmic primitives** sitting on top of them.

### Portable primitives

**1. Prompt-lookup n-gram proposer** —
`vllm-project__vllm/src/vllm/v1/spec_decode/ngram_proposer.py:12-100`.
Searches the current request's history for an n-gram match and emits the
*next k tokens* speculatively, no draft model needed. Metrics logged in
`spec_decode/metrics.py`: per-position acceptance, `mean_acceptance_length`.
**Client-side portable:** the whole thing can run in our DVP verifier layer
before a frontier call, provided we have the prompt/history in hand.

**2. Block-level prefix caching via content hash** —
`vllm-project__vllm/src/vllm/v1/core/kv_cache_utils.py` +
`kv_cache_manager.py:106-145`. SHA-256 (or `sha256_cbor` for multimodal) on
fixed-size KV blocks (default 16 tokens). If a block hash matches a prior
request, the KV slot is reused. **Portable as a cache-warmer:** we can
hash system-prompt prefixes at the same block size and maintain a separate
cache keyed by that hash, independent of the serving stack.

**3. Rejection sampling for speculative tokens** —
`vllm-project__vllm/src/vllm/v1/sample/rejection_sampler.py`. The math
from Leviathan et al. (ICML 2024): accept a speculative token iff
`u < min(1, target_prob/draft_prob)`, else resample from the residual.
Published acceptance rates 70–95% depending on draft-target alignment.
**Partially portable:** the algorithm is pure; only blocked by the fact
that Anthropic/OpenAI don't expose raw logits. Top-k logprobs (OpenAI) or
the Anthropic batch API's partial logprobs could approximate it.

**4. Sliding-window attention as recency bias** —
`vllm-project__vllm/src/vllm/v1/attention/backends/tree_attn.py:146` +
scheduler config `sliding_window_size`. vLLM truncates attention to the
last W tokens. **Portable as a prompt-engineering trick:** AICost already
preprocesses long histories via distillation; the sliding-window variant
is a cheaper ablation — take the last W raw tokens instead of summarising.

**5. KV-cache quantisation (FP8 / NF4)** —
`vllm-project__vllm/src/vllm/v1/attention/backends/flashinfer.py:92-120`.
4× memory, <0.1 BLEU loss on most models. **Portable only where the
provider exposes a `kv_cache_dtype` parameter** — no frontier API currently
does, but the knob exists in vLLM-backed OSS deployments (e.g. Together,
DeepInfra) and should be tested when the drafter runs there.

### Self-hosted-only (documented but not actionable here)

PagedAttention kernels, EAGLE3 / DFlash speculative heads, tree-attention
kernels, KV offloading manager, LMCache distributed KV pool, chunked
prefill, continuous batching. All real optimisations, all need GPU-level
control we don't have on the API side.

### Ledger-row candidates

```
2026-04-17 — Prompt-lookup n-gram proposer (client-side) · EST
- Baseline: DVP drafter generates full answer every turn.
- After:    n-gram lookup on request history emits next k tokens with
            no drafter call when a match exists. Measured 70–95%
            acceptance on code / repetitive tasks (vLLM + llama.cpp).
- Grade:    EST until we port it behind DVP.
- Verify by: vllm-project__vllm/src/vllm/v1/spec_decode/ngram_proposer.py;
             target port at src/core/ngram-proposer.ts + tests.

2026-04-17 — Content-hash prefix cache (system-prompt blocks) · EST
- Baseline: Full system prompt + tool catalogue re-sent every call.
- After:    SHA-256 block hashing at 16-token granularity; cache-aside
            on local disk; serve stale blocks through Anthropic's
            `cache_control` when prefixes match.
- Grade:    EST. vLLM demonstrates the block-hash design survives
            content-addressed dedup across requests.
- Verify by: vllm-project__vllm/src/vllm/v1/core/kv_cache_utils.py;
             candidate src/core/prefix-cache.ts.

2026-04-17 — Sliding-window truncation as preprocess fallback · EST
- Baseline: Preprocess.ts distils history to ≤60-80% of original tokens.
- After:    Trivial `lastW(history, W)` path as a sanity baseline;
            upgrade to distillation only when quality drops.
- Grade:    EST. vLLM ships sliding window as a native config knob;
            equivalent truncation at the prompt layer costs nothing.
- Verify by: vllm-project__vllm/src/vllm/v1/attention/backends/tree_attn.py:146.
```

## 2. ggerganov/llama.cpp — quantization + CPU KV cache

llama.cpp is the local drafter's natural home. Every primitive below
targets the *drafter side* of a DVP or speculative pipeline — making the
free half of the loop faster, smaller, or more reliable so we're never
bottlenecked there when the frontier is on the hot path.

### Portable primitives

**1. Q4_K_M weight quantisation** —
`ggerganov__llama.cpp/src/tools/quantize/README.md:34,82-93`.
Mixed 4-bit k-quants with per-block scale. Llama-3.1-8B: **14.96 GiB → 4.58 GiB
(–69%)**, prompt processing `821.81 ±21.44 t/s @ 512`, text generation
`71.93 ±1.52 t/s @ 128` (published in the README).
**Directly stacks on DVP:** drop-in for our qwen2.5-coder:7b drafter —
cuts drafter latency ~2-3× and frees VRAM for a longer prompt cache.

**2. KV-cache quantisation (`--cache-type-k q4_0` / `--cache-type-v q4_0`)** —
`ggerganov__llama.cpp/src/tools/completion/README.md:410`;
`llama.h:357-358`. 4-5× KV memory reduction. Marked `[EXPERIMENTAL]` —
recommend `q5_0` at >8 K context for safer quality tradeoff.
**Directly portable** via `llama_context_params.type_k/type_v`.

**3. Draft + target speculative decoding** —
`ggerganov__llama.cpp/src/examples/speculative/speculative.cpp:52-96`;
`docs/speculative.md:11-14`. Small 3 B draft + 7 B target. Measured
acceptance rates from the repo: **0.70312 (90/128)** and **0.57576 (171/297)**
— code-heavy workloads trend high, prose trends lower.
**Directly stacks on DVP** if the economics justify running two local
models; decides the drafter-side savings ceiling.

**4. N-gram / lookup speculative (no draft model)** —
`ggerganov__llama.cpp/src/docs/speculative.md:16-98`;
`common/ngram-cache.cpp`, `ngram-map.cpp`. Zero extra model overhead;
measured **0.76** acceptance on code/template workloads (`#gen drafts = 15,
#acc drafts = 15, #gen tokens = 960, #acc tokens = 730`).
Same primitive vLLM ships as its `ngram_proposer` — strongly converges on
the portable answer.

**5. Prompt cache (`--prompt-cache`, `llama_save_session_file`)** —
`ggerganov__llama.cpp/src/tools/completion/README.md:237-239`. Serialises
KV + embeddings after first inference; subsequent identical prefixes skip
encoding. **Directly portable** — our drafter-side cache for system
prompts + tool catalogues.

**6. LLGuidance constrained decoding (grammar / JSON schema)** —
`ggerganov__llama.cpp/src/docs/llguidance.md:1-29`. Compiles JSON Schema
or PEG grammar to a token-mask, ~50 µs per sample decision
(`p99 0.5 ms, p100 20 ms on llama3 128 k vocab` — quoted verbatim).
**Direct reinforcement for shorthand A/P/R:** compiled grammar forces
the verifier to emit `A|P|R<patch>|R<rewrite>` and nothing else, floor-capping
output tokens on the verify step.

**7. `llama-imatrix` importance-weighted quantisation** —
`ggerganov__llama.cpp/src/tools/imatrix/README.md:1-99`. Calibrate on
domain text; quantiser allocates higher precision to sensitive weights.
PRs #4861/#4930 show **0.05–0.15 perplexity-point** gains over naive
Q4_K_M. Free quality recovery after we quantise.

**8. Batched / parallel decoding** —
`ggerganov__llama.cpp/src/tools/batched-bench/README.md:45`. Llama-7B
Q8_0 hits **465.40 t/s at B=32 parallel** vs **41.57 t/s at B=1** (same
hardware) — an **11× throughput multiplier**. Relevant for shadow-mode
grading where we run many drafts in parallel against the frontier.

### Ledger-row candidates

```
2026-04-17 — Drafter quantisation (F16 → Q4_K_M) · MODELED
- Baseline: qwen2.5-coder:7b F16, 14.96 GiB VRAM, ~40 t/s @ 128 tokens.
- After:    Q4_K_M build: 4.58 GiB, ~72 t/s @ 128 tokens (–69% memory,
            +77% throughput — llama.cpp README bench).
- Grade:    MODELED. Drafter swap is a one-line change to the Ollama
            modelfile; re-run the shadow determinism suite post-swap.
- Verify by: ggerganov__llama.cpp/src/tools/quantize/README.md:143-145;
             evidence/14_determinism_ledger.md for retention re-measurement.

2026-04-17 — KV-cache quantisation on drafter · EST
- Baseline: drafter KV at F16 (~16 MB/layer at 2 K ctx).
- After:    q5_0 (not q4_0 — less quality risk at >8 K ctx), ~4×
            memory cut, enables much longer cached prefixes.
- Grade:    EST. Flagged [EXPERIMENTAL] upstream — measure PPL at our
            target context before flipping.
- Verify by: ggerganov__llama.cpp/src/llama.h:357-358.

2026-04-17 — Grammar-constrained verifier output (LLGuidance) · MODELED
- Baseline: DVP verifier emits free-form tokens; shorthand A-path is
            already ~1 token but relies on prompt discipline.
- After:    Compiled A|P|R grammar forces output shape. Rewrite-path
            tokens stay bounded; approve-path is guaranteed single-token.
- Grade:    MODELED. Published grammar overhead ≤0.5 ms p99.
- Verify by: ggerganov__llama.cpp/src/docs/llguidance.md:28.
```

## 3. microsoft/LLMLingua — prompt compression (direct DVP precedent)

### Core algorithm — iterative token-importance ranking via perplexity

Heart of LLMLingua is **iterative token-level compression driven by a small
LM's cross-entropy loss**, not LLM summarization. Main loop at
`microsoft__LLMLingua/src/llmlingua/prompt_compressor.py:1523-1750`
(`iterative_compress_prompt`):

1. Segment prompt into fixed chunks (default 200 tokens).
2. Compute per-token cross-entropy via the small LM's forward pass.
3. Optionally condition on a downstream question so tokens are weighted for
   QA relevance (`condition_in_question` — LongLLMLingua variant).
4. Threshold-prune: keep top `ratio` fraction by loss, drop the rest.
5. Iterate across the full prompt, reusing KV-cache to avoid re-computation.

Why it beats "call an LLM to summarise":
- **No external API calls** during compression (pure local forward passes).
- **Deterministic and gradient-free** — perplexity is a stable importance proxy.
- **Sub-second latency** on 10K-token contexts (vs. minutes for GPT-4 summarisation).
- **Preserves original tokens** — doesn't hallucinate or paraphrase.

### Measured compression ratios (from `examples/LLMLingua2.ipynb`)

| Task | Baseline tokens | Compressed | Ratio | Quality retention |
|------|----------------:|-----------:|:-----:|------------------|
| MeetingBank QA (in-domain) | 1 362 | 444 | **3.1×** | Matches baseline |
| NarrativeQA (LongBench) | 9 059 | 3 064 | **2.96×** | No answer degradation |
| TriviaQA (multi-doc) | 5 527 | 1 805 | **3.06×** | Correct answer preserved |
| Gov Report (summarisation) | 10 705 | 3 254 | **3.29×** | Summary quality held |
| GSM8K (in-context learning) | 2 359 | 148 | **15.9×** | Reasoning steps preserved |

Typical: **2.5–4×** reduction with >95% task accuracy retention.

### Variants

- **LLMLingua v1** — `Llama-2-7b-hf` (14 GB VRAM) as ranker; ~100–200 ms / 200-token chunk.
- **LongLLMLingua** — adds `condition_in_question` + `reorder_context="sort"` +
  dynamic compression ratio (0.3–0.4 for low-relevance docs). For multi-doc RAG.
- **LLMLingua-2** — distils v1 into a **BERT-base token classifier** trained on
  GPT-4 compression labels. 380 MB model, CPU-viable, **3–6× faster than v1**,
  generalises out-of-domain.

### Integration sketch for ai-cost

`src/core/tool-compress.ts` currently does threshold-based Ollama summarisation
on tool-results > 1500 tokens. Two port paths:

- **Option A — Python sidecar (recommended).** FastAPI + `transformers`
  shipping LLMLingua-2 BERT-base. ai-cost calls `POST /compress`; result cached.
  ~50–100 ms on CPU, no GPU. Reuses Microsoft's training directly.
- **Option B — pure TS via ONNX.js.** Export LLMLingua-2 classifier to ONNX,
  run in-process. Lower latency, but the KV-cache optimisation for the
  iterative loop is non-trivial to replicate in JS.

### Failure modes

- **Exact-quote tasks** (legal, code): perplexity is too flat, pruning
  drops load-bearing tokens. Mitigation: `force_tokens=[...]`.
- **Math / code generation**: a missing operator breaks the answer.
  Mitigation: `force_reserve_digit=True`, force-list operators.
- **JSON / indentation-sensitive output**: brackets and whitespace get
  pruned. Use `structured_compress_prompt` with per-section rates.
- **Non-MeetingBank domains** for v2: still multilingual-capable but
  MeetingBank-biased; degrades on highly technical corpora.

### Ledger-row candidates

```
2026-04-17 — Prompt compression via LLMLingua (sidecar) · EST
- Baseline: tool-compress.ts Ollama summariser, ~30% output length @ ~8ms.
- After:    LLMLingua-2 BERT-base classifier @ ~50-100ms on CPU, 3.1×
            compression (1362 → 444 on MeetingBank), quality parity.
- Grade:    EST. Port = Python sidecar behind existing tool-compress API.
- Verify by: scripts/sidecars/llmlingua/ + tests/unit/tool-compress-llmlingua.test.ts
             (not yet written); MeetingBank notebook reproduces the 3.1× claim.

2026-04-17 — Question-aware compression (LongLLMLingua) for RAG · MODELED
- Baseline: uniform compression ratio across all retrieved passages.
- After:    Dynamic ratio (0.3–0.4 low-relevance, higher for pertinent docs);
            3.29× on Gov Report with summary quality held.
- Grade:    MODELED. Applies when router classifies call as retrieval-heavy.
- Verify by: LongBench gov_report result in LLMLingua repo (10 705 → 3 254).

2026-04-17 — Constrained decoding as compression floor · EST
- Baseline: output tokens unconstrained.
- After:    force_reserve_digit + operator/whitespace force-lists cap
            quality loss on code/math while preserving 15.9× compression
            on in-context-learning examples (GSM8K).
- Grade:    EST. Stacks on shorthand A/P/R; limits DVP rewrite-path tokens.
- Verify by: LLMLingua GSM8K notebook (2 359 → 148).
```

## 4. BerriAI/litellm — provider routing + response cache

LiteLLM is the incumbent in the router+cache space — a useful
*scope check* for what ai-cost already covers and where there's material
white space.

### What ai-cost already matches

LiteLLM's simple-shuffle / least-busy / latency / TPM-RPM v2 routers
(`litellm/router_strategy/*.py`), multi-backend cache
(`litellm/caching/caching.py:51-100`), Redis / Qdrant semantic cache
(`litellm/caching/qdrant_semantic_cache.py:30-277`), per-provider budget
limiter (`litellm/router_strategy/budget_limiter.py:1-120`), and
provider-specific cost calculator (`litellm/cost_calculator.py:260-322`)
all have ai-cost equivalents (router, semantic-cache ≥0.95 cosine, cost
telemetry).

### New primitives worth porting

**1. Cost-accrual routing** —
`BerriAI__litellm/src/litellm/router_strategy/lowest_cost.py:18-80`.
Maintains a `{model_group}_map` of per-deployment spend accrued inside the
current minute window and picks the cheapest. Distinct from latency-based:
**explicit $/req selection** when multiple providers are healthy.
ai-cost router currently optimises quality-per-dollar at call time but
doesn't do deployment-accrual aggregation — that's a gap.

**2. Prompt-caching deployment filter** —
`BerriAI__litellm/src/litellm/router_utils/pre_call_checks/prompt_caching_deployment_check.py:19-100`.
Before each call, filters to deployments that *already cached* this message
sequence. Cache-read tokens cost **10% of creation on Anthropic, 50% on
OpenAI** (per `cost_calculator.py:270-322`). Sticky routing is the
prerequisite to actually *realise* those savings at scale.
ai-cost has no equivalent — we rely on the provider's auto-caching but
don't route sticky.

**3. Content-policy / context-window fallback chains** —
`BerriAI__litellm/src/litellm/router.py:235,269,1539,1821`. Distinct
fallback chains by *error class* (`content_policy_fallbacks`,
`context_window_fallbacks`). ai-cost's fallback logic is a single
list; splitting by class avoids "oversized prompt → route to a model
that'll refuse for moderation" wasteful retries.

**4. Router-level batch completion** —
`BerriAI__litellm/src/litellm/router.py:2496-2588`
(`abatch_completion`, `abatch_completion_one_model_multiple_requests`).
Wraps Anthropic Message Batches + OpenAI Batch API behind the same
router, so cost-aware model selection composes with the **50% batch
discount** both providers offer. ai-cost has batching but not at the
router layer — this is the integration point.

### Ledger-row candidates

```
2026-04-17 — Cost-accrual routing · EST
- Baseline: ai-cost router chooses per-call on quality/latency weighting.
- After:    Add per-deployment $/min accumulator; prefer lowest-accrual
            when quality buckets are tied.
- Grade:    EST. Low implementation cost (< 50 LoC on top of existing router).
- Verify by: BerriAI__litellm/src/litellm/router_strategy/lowest_cost.py:18-80.

2026-04-17 — Prompt-caching sticky routing · MODELED
- Baseline: Provider auto-cache hit rate ≈ opportunistic; no routing guidance.
- After:    Pre-call filter keeps a {prompt-hash → last-cached-deployment}
            map and biases routing; cache-read = 10% of creation (Anthropic)
            or 50% (OpenAI) per provider pricing.
- Grade:    MODELED against published Anthropic/OpenAI cache-read prices.
- Verify by: BerriAI__litellm/src/litellm/router_utils/pre_call_checks/prompt_caching_deployment_check.py:19-100.

2026-04-17 — Error-class-split fallback chains · EST
- Baseline: Single fallback list regardless of error class.
- After:    Separate chains for content-policy vs context-window vs
            rate-limit vs timeout.
- Grade:    EST. Avoids wasted retries on predictably-failing targets.
- Verify by: BerriAI__litellm/src/litellm/router.py:235,269.

2026-04-17 — Batch API wrapper behind router · EST
- Baseline: Batch API used ad-hoc; not integrated with router strategy.
- After:    `router.abatch_completion()` composes cost/latency routing
            with the 50% batch discount (Anthropic Message Batches,
            OpenAI Batch API).
- Grade:    EST. Published 50% discount is verifiable from provider docs.
- Verify by: BerriAI__litellm/src/litellm/router.py:2496-2588.
```

## 5. FasterDecoding/Medusa — speculative decoding heads

### Architecture

Medusa attaches **N independent heads** (default 5 in `vicuna_7b_stage2`)
to the backbone at a single hidden-state position; each head predicts
*k future tokens* in parallel. Source:
`FasterDecoding__Medusa/src/medusa/model/medusa_model.py:111-119`.
Each head is a residual block + `Linear(hidden_size, vocab_size)`.

### Tree attention

Candidate token paths are organised as a prefix tree
(`medusa/model/medusa_choices.py`). With k≈10 top-k per head and depth 4,
~63 candidate sequences are evaluated **in one forward pass** via a
tree-structured attention mask. Reorder via `retrieve_indices` →
rejection-sampling accept length → commit.

### Published speedups (README, verbatim)

- **Medusa-1** (heads-only fine-tune): "approximately a 2x speed increase
  across Vicuna models".
- **Medusa-2** (full-model training): "2.2–3.6x speedup over the original
  model on a range of LLMs".

Training cost for Medusa-1: ~4–8 GPU-hours on 4× A100 (`medusa_num_heads=3,
medusa_num_layers=1`, LR 1e-3, backbone frozen).

### The API-boundary break

Medusa's primitive does **not** survive the API boundary. The frontier
doesn't expose per-token logits, doesn't accept a tree of speculative
tokens, and doesn't let us reuse KV across verify calls.

What *does* carry over is the **acceptance math**
(`medusa/model/utils.py:436-530`, `evaluate_posterior`). Modes:

- Greedy (temp 0) — match max-logit tokens
- Typical sampling — entropy-based thresholding
- Nucleus (top-p) — cumulative probability cutoff

All three are pure functions of logit/logprob arrays. `draft-verify.ts`
currently does binary string match on the approve path; an
entropy-aware posterior would let the patch path auto-size by confidence
instead of by hand-tuned thresholds.

### Pseudo-Medusa at the API boundary

If a provider exposes `top_logprobs` (OpenAI does; Anthropic roadmap
item), a *degraded* approximation is possible:

1. Local drafter proposes k tokens.
2. Call frontier once with `logprobs.top=5`.
3. Offline, re-score draft tokens against frontier's returned top-k.
4. Commit the longest prefix where all draft tokens are within the top-k.

Realistic latency saving: **10–30%** vs. full frontier generation. This
is far from Medusa's 2-3× but still a non-trivial win when the
drafter-frontier RTT dominates.

### Ledger-row candidates

```
2026-04-17 — Entropy-aware DVP patch-acceptance (Medusa posterior port) · EST
- Baseline: draft-verify.ts uses deterministic string-match on approve;
            patch path has hand-tuned length thresholds.
- After:    Port medusa/model/utils.py:evaluate_posterior (typical-sampling
            mode) into a TS helper; accept the longest prefix that passes
            the entropy threshold.
- Grade:    EST. Pure function; no external deps; unit-testable against
            logprob fixtures captured from a shadow run.
- Verify by: FasterDecoding__Medusa/src/medusa/model/utils.py:404-530;
             target tests/unit/draft-verify-posterior.test.ts.

2026-04-17 — Pseudo-Medusa top-k logprob acceptance (OpenAI only) · EST
- Baseline: No client-side token-level acceptance against frontier output.
- After:    Where provider exposes top_logprobs, re-score local draft
            tokens against top-k; commit longest matching prefix.
            10-30% expected end-to-end latency win on draft-heavy workloads.
- Grade:    EST. Blocked on Anthropic exposing logprobs for portability.
- Verify by: FasterDecoding__Medusa/src/medusa/model/utils.py:309-348
             (tree_decoding reference); OpenAI `logprobs.top_logprobs` param.

2026-04-17 — Medusa training-feasibility snapshot · reference
- Medusa-1: ~4-8 GPU-hours on 4x A100 to attach heads to Vicuna-7b, 2x
            speedup. Relevant if we ever choose to self-host the drafter
            side in a frontier-alternative stack; not a short-horizon lever.
- Grade:    reference (not a live ledger row).
```

## 6. stanfordnlp/dspy — programmatic prompt optimization

DSPy's cost-reducing primitives cluster around **demonstration compression**
(fewer, better in-context examples) and **schema-forced output** (fewer
output retries).

### Primitives with a cost lever

**1. `BootstrapFewShot`** —
`stanfordnlp__dspy/src/dspy/teleprompt/bootstrap.py:36-273`.
Compiles a student module to auto-select `max_bootstrapped_demos=4`
plus `max_labeled_demos=16` from a trainset. Replaces hand-authored
multi-shot prompts with a cap-enforced demo budget.
Lever: **input-tokens** (bounded demo count).

**2. `MIPROv2`** —
`stanfordnlp__dspy/src/dspy/teleprompt/mipro_optimizer_v2.py:61-120`.
Jointly tunes few-shot demos + instructions via Optuna. Lines 43-44
hard-code `BOOTSTRAPPED_FEWSHOT_EXAMPLES_IN_CONTEXT = 3`,
`LABELED_FEWSHOT_EXAMPLES_IN_CONTEXT = 0` — aggressive demo compression.
Lever: **input-tokens** + tighter instruction phrasing.
Trade: optimisation has real upfront LM cost; only pays off at scale.

**3. `KNNFewShot`** —
`stanfordnlp__dspy/src/dspy/teleprompt/knn_fewshot.py:11-70`. At inference,
embeds the query and retrieves k=3 nearest demos from a training set.
Replaces a static multi-shot prompt with a dynamic, query-specific one.
Lever: **input-tokens** (only relevant demos survive).

**4. `JSONAdapter` / structured output** —
`stanfordnlp__dspy/src/dspy/adapters/json_adapter.py:40-79`. Binds input
/output fields to a Pydantic signature; emits `response_format={"type":
"json_object"}` (or native structured output where supported). Kills the
malformed-output retry loop.
Lever: **output-tokens** + **calls-per-task**.

**5. `Retrieve`** — `stanfordnlp__dspy/src/dspy/retrievers/retrieve.py:18-66`.
Lightweight `dspy.settings.rm(query, k=k)` call — no LM involvement in the
retrieval step. Pre-filter for generation context.
Lever: **input-tokens** (caps retrieval context).

**6. `Ensemble` (stochastic sampling mode)** —
`stanfordnlp__dspy/src/dspy/teleprompt/ensemble.py:10-40`. When `size` < full
set, samples programs stochastically instead of running all. Paired with
majority-vote reduce, recovers quality at reduced cost.
Relevant as a cheaper alternative to R08's full verifier swarm.

### Quality-only (not cost levers)

`ChainOfThought`, `MultiChainComparison`, `COPRO` instruction generation.
Improve output quality but don't directly cut tokens — noted so we don't
mis-promote them into the ledger.

### Ledger-row candidates

```
2026-04-17 — Bounded demo budget (BootstrapFewShot port) · EST
- Baseline: Static hand-authored multi-shot prompts, variable demo count.
- After:    Cap at max_bootstrapped_demos=4 + max_labeled_demos=16 with
            automatic selection from a captured evaluation-set trainset.
- Grade:    EST. Trainset = replayed shadow-mode transcripts already
            on disk in `.ai-cost-data/events.jsonl`.
- Verify by: stanfordnlp__dspy/src/dspy/teleprompt/bootstrap.py:36-273;
             target tests/unit/demo-budget.test.ts.

2026-04-17 — Per-query KNN demo retrieval · EST
- Baseline: Static demo set for every call.
- After:    Embed query (reuse nomic-embed-text from semantic cache),
            retrieve k=3 nearest demos at call time. Cuts tail
            demo-token count on long-tail queries.
- Grade:    EST. Stacks on existing local embedding infra.
- Verify by: stanfordnlp__dspy/src/dspy/teleprompt/knn_fewshot.py:11-70.

2026-04-17 — Schema-bound output via JSONAdapter · EST
- Baseline: Output parsed via regex or retry-on-failure.
- After:    Pydantic signature → provider-native structured output when
            available; fallback to JSON-mode otherwise.
- Grade:    EST. Removes one retry on malformed output per failure case.
- Verify by: stanfordnlp__dspy/src/dspy/adapters/json_adapter.py:40-79.
```

---

## Aggregate — wedges mapped to ledger rows

Cross-reference of every candidate from this batch against the existing
ledger in `evidence/20_throughput_cost_ledger.md`. Two lenses:

- **Redundant** — already covered by a shipped ai-cost row.
- **Additive** — new wedge, doesn't overlap existing rows.

### Wedges ai-cost already covers

| Candidate | Existing ledger row | Note |
|-----------|---------------------|------|
| LiteLLM simple-shuffle / least-busy / latency routing | _(router)_ | ai-cost has equivalent primitives. |
| LiteLLM multi-backend cache | _(semantic cache)_ | ai-cost's cosine ≥0.95 cache is the same pattern. |
| LiteLLM rate-limit aware routing | _(router)_ | Existing behaviour. |
| LiteLLM provider/deployment budget limiter | _(implicit in router)_ | Minor: port as explicit knob if a deployment needs hard caps. |
| vLLM PagedAttention / tree-attn / KV-offload / LMCache | n/a | Self-hosted-only. Skip. |
| Medusa tree attention | n/a | Not portable; see pseudo-Medusa below. |

### Net-new additive wedges — top 6 to promote

Ranked by expected dollar impact × implementation cost, strongest first:

1. **LLMLingua-2 BERT-base compressor** — 3.1× compression vs. existing
   tool-compress Ollama summariser, CPU-viable, quality parity.
   Highest-leverage single port. (§3)

2. **Prompt-caching sticky routing** — cache-read tokens cost 10% of
   creation on Anthropic; without sticky routing we leave most of that
   discount on the table. Implementation is pre-call filter + hash map.
   (§4)

3. **Drafter Q4_K_M quantisation** — 69% VRAM / +77% throughput on
   our qwen2.5-coder drafter. One-line Ollama modelfile change.
   Determinism suite re-runs verify quality. (§2)

4. **Prompt-lookup n-gram proposer** — same primitive in vLLM *and*
   llama.cpp, converges on strong evidence. 0.70–0.76 acceptance on
   code. Client-side portable. (§1, §2)

5. **Grammar-constrained verifier output (LLGuidance)** — floor-caps
   DVP verify-path tokens even when the model misbehaves. 50 µs/token
   overhead. Direct reinforcement for shorthand A/P/R. (§2)

6. **Schema-bound output (DSPy `JSONAdapter` port)** — removes the
   malformed-output retry class. Not a huge cost cut, but high
   reliability win. (§6)

### Second-tier additive wedges (port after the top 6)

7. Cost-accrual routing (LiteLLM `lowest_cost`). (§4)
8. KV-cache quantisation on drafter (q5_0 for safety). (§2)
9. Bounded demo budget / KNN demo retrieval (DSPy). (§6)
10. Entropy-aware DVP patch-acceptance (Medusa posterior port). (§5)
11. Batch API router wrapper (50% provider discount). (§4)
12. Content-hash prefix cache (vLLM block-hashing pattern). (§1)

### Combined upside estimate (conservative)

Stacked on the existing 2026-04-17 aggregate table (~$150/mo/dev
combined with overlap discount), the top-6 additions plausibly add a
further **$25-70/mo/dev** (rough order-of-magnitude) before overlap
discount. Most of this comes from (a) actually realising Anthropic
cache-read savings via sticky routing and (b) LLMLingua-scale prompt
compression on the hot tool-result path. Numbers need a measured pass
before promoting from EST to MEASURED.

### Failure-mode convergence notes

Two failure modes showed up repeatedly across repos and deserve explicit
ai-cost guardrails:

- **Code / math degradation under compression** — LLMLingua and
  LLGuidance both warn operators to force-preserve digits, operators,
  whitespace. Any compressor port must expose `force_tokens` and default
  to conservative settings on `score.ts:dvp_candidate` calls tagged
  "code".
- **Low acceptance on novel domains** — speculative / n-gram proposers
  degrade sharply outside training or cache domain. Plan for A/B
  fallback and acceptance-rate telemetry from day one, not after launch.

### Next actions

Three ledger rows are promoted into `evidence/20_throughput_cost_ledger.md`
in this pass (the top-three highest-leverage). The rest stay here as
drafts until a sponsor signs off on the implementation order.
