# hypr

<img src="logo.svg" alt="Hyperstack · Pi · FreeBSD · AI · tmux logo" width="600"/>

Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference.
Runs two A100 VMs concurrently — each serving a different model — with [Pi](https://pi.dev) coding agents connected to each.

## Architecture

```
  ┌─────────────┐
  │   Laptop    │
  └──────┬──────┘
         │ SSH (into bhyve VM, not the host)
         │
  FreeBSD physical host (earth)
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                                                                         │
  │  FreeBSD bhyve VM  (isolation layer)          192.168.3.2 / wg1        │
  │  ┌───────────────────────────────────────────────────────────────────┐  │
  │  │                          ▲                                        │  │
  │  │                          │ SSH                                    │  │
  │  │  tmux session  (tmux attach)                                      │  │
  │  │  ┌─────────────────────────────────────────────────────────────┐  │  │
  │  │  │  window 0                                                   │  │  │
  │  │  │  ┌───────────────────────┬─────────────────────────┐        │  │  │
  │  │  │  │ pane 0: pi-nemotron   │ pane 1: pi-coder        │        │  │  │
  │  │  │  │                       │                         │        │  │  │
  │  │  │  │ Pi                    │ Pi                      │        │  │  │
  │  │  │  │ Nemotron-3-Super      │ Qwen3-Coder-Next        │        │  │  │
  │  │  │  └──────────┬────────────┘└────────────┬───────────┘        │  │  │
  │  │  │             │ OpenAI API               │ OpenAI API         │  │  │
  │  │  │             │ /v1/chat/completions      │ /v1/chat/completions│ │  │
  │  │  └─────────────┼──────────────────────────┼────────────────────┘  │  │
  │  │              │                           │                        │  │
  │  └──────────────┼───────────────────────────┼────────────────────────┘  │
  │                 │ WireGuard wg1             │ WireGuard wg1             │
  └─────────────────┼───────────────────────────┼───────────────────────────┘
                    │ 192.168.3.0/24            │ 192.168.3.0/24
                    │ UDP :56710                │ UDP :56710
                    ▼                           ▼
  ┌──────────────────────────┐  ┌──────────────────────────┐
  │ VM1 (A100 80GB)          │  │ VM2 (A100 80GB)          │
  │ 192.168.3.1              │  │ 192.168.3.3              │
  │ hyperstack1.wg1          │  │ hyperstack2.wg1          │
  │                          │  │                          │
  │ vLLM :11434              │  │ vLLM :11434              │
  │ Nemotron-3-Super 120B    │  │ Qwen3-Coder-Next 80B     │
  │ (Mamba+MoE, AWQ-4bit)    │  │ (MoE, AWQ-4bit)          │
  └──────────────────────────┘  └──────────────────────────┘
```

**WireGuard topology:**
- Interface `wg1` on earth carries traffic to **both** VMs simultaneously
- earth is `192.168.3.2`; VM1 is `.1`; VM2 is `.3`; tunnel port is `56710/udp`
- Adding VM2 to an existing wg1 tunnel: `wg1-setup.sh` adds a second `[Peer]` block without disturbing VM1
- vLLM on each VM listens on `0.0.0.0:11434`, firewalled to `192.168.3.0/24` (WireGuard subnet only)
- Pi connects directly to each VM's vLLM over the tunnel — no proxy or load balancer

## Why Pi

- **Bring-your-own model** — connects to any OpenAI-compatible endpoint; no translation proxy needed between Pi and vLLM
- **Custom providers via `models.json`** — define `hyperstack`, `hyperstack1`, and `hyperstack2` providers once; fish abbreviations route to the right VM
- **Project-local config** — symlink this repo's `pi/` directory to `~/.pi`; Pi picks up `models.json`, `settings.json`, extensions, and skills automatically
- **TypeScript extensions** — custom behaviour (web search, loop scheduler, ask-mode) lives in `pi/agent/extensions/` and loads from the symlink
- **Minimal core** — no built-in sub-agents, plan mode, or permission popups; fast TUI with mid-session model switching via `Ctrl+L`

## Prerequisites

- Hyperstack account with API key in `~/.hyperstack`
- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML)
- Review `[network].allowed_ssh_cidrs` and `[network].allowed_wireguard_cidrs` in your TOML.
  The secure default is `["auto"]`, which resolves your current public egress IP to `/32`.
  Set explicit CIDRs or `HYPERSTACK_OPERATOR_CIDR` if you deploy from a different network.
- WireGuard setup script: `wg1-setup.sh` (present in this directory)
- Ruby with `toml-rb` gem: `bundle install`
- [Pi](https://pi.dev) coding agent installed

## WireGuard setup

`hyperstack.rb` runs `wg1-setup.sh` automatically during `create` / `create-both`.
This section explains the tunnel design for reference and manual troubleshooting.

### Tunnel design

```
earth (192.168.3.2)
  /etc/wireguard/wg1.conf
  [Interface]  Address = 192.168.3.2/24
  [Peer]  # VM1 — AllowedIPs = 192.168.3.1/32, Endpoint = <vm1-public-ip>:56710
  [Peer]  # VM2 — AllowedIPs = 192.168.3.3/32, Endpoint = <vm2-public-ip>:56710
```

A single `wg1` interface on earth carries traffic to both VMs. Each VM is a separate `[Peer]`
block. Adding VM2 to an existing tunnel with VM1 already running leaves VM1's peer untouched.

### Manual setup

```bash
# VM1 (first VM — generates fresh keys, writes /etc/wireguard/wg1.conf from scratch)
./wg1-setup.sh <vm1-public-ip>

# VM2 (additional VM — adds a [Peer] block to the existing wg1.conf)
./wg1-setup.sh <vm2-public-ip> 192.168.3.3 hyperstack2.wg1
```

### Verify the tunnel

```bash
# Show active peers and handshake times (both VMs should appear)
sudo wg show wg1

# Ping each VM through the tunnel
ping -c 3 192.168.3.1   # VM1
ping -c 3 192.168.3.3   # VM2

# Check vLLM is reachable over the tunnel
curl http://hyperstack1.wg1:11434/v1/models
curl http://hyperstack2.wg1:11434/v1/models
```

### Restart / recover

```bash
# Restart tunnel locally (e.g. after network change)
sudo systemctl restart wg-quick@wg1

# Restart tunnel on VM after a reboot (ssh via public IP since WireGuard is down)
ssh ubuntu@<vm-public-ip> 'sudo systemctl start wg-quick@wg1'

# Re-run setup when VM IP changes (e.g. after delete + recreate)
./wg1-setup.sh <new-vm1-public-ip>
./wg1-setup.sh <new-vm2-public-ip> 192.168.3.3 hyperstack2.wg1
```

## Quickstart (two-VM setup)

```bash
# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min)
ruby hyperstack.rb create-both

# Verify both VMs are working
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test

# Launch Pi coding agents — one per terminal (fish abbreviations from hyperstack.fish)
pi-hyperstack-nemotron   # Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # Qwen3-Coder-Next on VM2

# Tear down both VMs
ruby hyperstack.rb delete-both
```

## Using Pi

[Pi](https://pi.dev) is the coding agent frontend used with this setup.
Each Hyperstack VM runs a vLLM instance; Pi connects to it directly over the WireGuard tunnel.

### Installation

Install Pi from [pi.dev](https://pi.dev), then link the project-local config into place:

```bash
ln -s /path/to/hyperstack/pi ~/.pi
```

This symlink makes Pi pick up `pi/agent/models.json` and `pi/agent/settings.json`
from this repo as its agent configuration, so the Hyperstack providers and model
definitions are available without any manual config editing.

### Fish shell abbreviations

Source `hyperstack.fish` or copy the abbreviations into your Fish config:

```fish
abbr pi-hyperstack         pi --model hyperstack/openai/gpt-oss-120b
abbr pi-hyperstack-nemotron pi --model hyperstack1/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
abbr pi-hyperstack-coder    pi --model hyperstack2/bullpoint/Qwen3-Coder-Next-AWQ-4bit
```

Then launch a session after the VM(s) are up:

```fish
pi-hyperstack            # single-VM → GPT-OSS 120B on hyperstack.wg1
pi-hyperstack-nemotron   # two-VM → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # two-VM → Qwen3-Coder-Next 80B on VM2
```

### Model configuration (`pi/agent/models.json`)

Three providers are defined, one per setup, each pointing at its vLLM endpoint over WireGuard:

| Provider | Base URL | Primary model |
|----------|----------|---------------|
| `hyperstack` | `http://hyperstack.wg1:11434/v1` | GPT-OSS 120B (single-VM) |
| `hyperstack1` | `http://hyperstack1.wg1:11434/v1` | Nemotron-3-Super 120B |
| `hyperstack2` | `http://hyperstack2.wg1:11434/v1` | Qwen3-Coder-Next 80B |

All model presets from the TOML configs are registered under each provider, so any
model can be run on any VM after a `model switch` (see [Switching models](#switching-models)).

### Settings (`pi/agent/settings.json`)

```json
{
  "defaultProvider": "openai",
  "defaultModel": "gpt-4.1"
}
```

The default provider/model is OpenAI so that bare `pi` uses OpenAI rather than a Hyperstack VM.
Use the fish abbreviations above to route to a specific VM.

### Hot-switching models within Pi

After loading a different model on a VM with `model switch` (see [Switching models](#switching-models)),
tell Pi to use it without restarting the session:

```
model switch hyperstack1/openai/gpt-oss-120b
```

Pi sends subsequent requests to the new model ID immediately; the provider base URL stays the same.

## Extensions

Custom extensions live in `pi/agent/extensions/` and are loaded automatically via the `~/.pi` symlink.

| Extension | Purpose |
|-----------|---------|
| `web-search` | `web_search` and `web_fetch` tools — DuckDuckGo search + page fetching, no API key |
| `ask-mode` | `/ask` command — restricts the model to read-only exploration tools |
| `loop-scheduler` | `/loop` command — re-sends a prompt on a recurring interval |
| `inline-bash` | `!{cmd}` syntax — expands shell output inline before sending to the model |
| `session-name` | Auto-names sessions from the first message |
| `modal-editor` | Opens an external editor (`$VISUAL`) for composing long prompts |
| `handoff` | Compacts and hands off context to a fresh session |
| `fresh-subagent` | Spawns a sub-agent in a clean context for isolated tasks |
| `reload-runtime` | `/reload-runtime` command — hot-reloads extensions without restarting Pi |
| `nemotron-tool-repair` | Repairs malformed tool calls from Nemotron models |
| `agent-plan-mode` | Integrates task management into Pi sessions |

### Web search

The `web-search` extension registers two LLM-callable tools:

- **`web_search`** — searches DuckDuckGo and returns up to 8 results (title, URL, snippet)
- **`web_fetch`** — fetches a URL and returns up to 12,000 characters of readable text

Example prompts:

```
Search for the vLLM 0.9.0 changelog
Find the Qwen3-Coder model card and summarize the recommended vLLM flags
```

No API key or account required. Uses DuckDuckGo's free HTML endpoint.

## Single-VM setup

A single VM can be deployed with the default config (GPT-OSS 120B):

```bash
ruby hyperstack.rb create                # uses hyperstack-vm.toml
ruby hyperstack.rb test
pi-hyperstack                            # fish abbreviation → hyperstack/openai/gpt-oss-120b
ruby hyperstack.rb delete
```

## VM configuration

| Config file | Default model | WireGuard IP | Hostname |
|---|---|---|---|
| `hyperstack-vm1.toml` | Nemotron-3-Super 120B (AWQ-4bit) | `192.168.3.1` | `hyperstack1.wg1` |
| `hyperstack-vm2.toml` | Qwen3-Coder-Next 80B (AWQ-4bit) | `192.168.3.3` | `hyperstack2.wg1` |
| `hyperstack-vm.toml` | GPT-OSS 120B (single-VM mode) | `192.168.3.1` | `hyperstack.wg1` |

Each VM has independent state files so they can be managed separately:

```bash
ruby hyperstack.rb --config hyperstack-vm1.toml status
ruby hyperstack.rb --config hyperstack-vm2.toml status
```

## Switching models

Each VM has named model presets in its TOML config. Hot-switch without reprovisioning:

```bash
ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super
```

Available presets (both VMs share the same set):

| Preset | Model | VRAM | Context |
|---|---|---|---|
| `nemotron-super` | Nemotron-3-Super 120B (Mamba+MoE, 12B active) | ~60 GB | 131K |
| `qwen3-coder-next` | Qwen3-Coder-Next 80B (MoE, AWQ-4bit) | ~45 GB | 262K |
| `gpt-oss-120b` | GPT-OSS 120B (MoE, MXFP4) | ~65 GB | 131K |
| `gpt-oss-20b` | GPT-OSS 20B (MoE, MXFP4) | ~14 GB | 65K |
| `qwen25-coder-32b` | Qwen2.5-Coder-32B-Instruct (AWQ) | ~18 GB | 32K |
| `qwen3-coder-30b` | Qwen3-Coder-30B-A3B (MoE, AWQ) | ~18 GB | 65K |
| `deepseek-r1-32b` | DeepSeek-R1-Distill-Qwen-32B (AWQ) | ~18 GB | 32K |
| `qwen3-32b` | Qwen3-32B (AWQ) | ~18 GB | 32K |
| `devstral` | Devstral-Small-2507 (AWQ-4bit) | ~15 GB | 32K |

## CLI reference

```
ruby hyperstack.rb [--config path] <command> [options]

Commands:
  create       Deploy a new VM and run full provisioning
  create-both  Deploy VM1 + VM2 in parallel (uses hyperstack-vm1/vm2.toml)
  delete       Destroy the tracked VM
  delete-both  Destroy both VM1 and VM2
  status       Show VM and WireGuard status
  watch        Live dashboard: vLLM + GPU stats for all active VMs (refreshes every 5 s)
  test         Run end-to-end inference tests (vLLM)
  model switch <preset>  Hot-switch the running vLLM model

create / create-both options:
  --replace          Delete existing tracked VM before creating
  --dry-run          Print the plan without making changes
  --vllm / --no-vllm    Override config: enable/disable vLLM setup
  --ollama / --no-ollama Override config: enable/disable Ollama setup
```

## Configuration

Edit `hyperstack-vm1.toml` / `hyperstack-vm2.toml` (or `hyperstack-vm.toml` for single-VM).
Key sections:

| Section | Purpose |
|---------|---------|
| `[vm]` | Flavor, image, environment name |
| `[vllm]` | Model, container settings, and vLLM runtime options |
| `[vllm.presets.*]` | Named model presets for hot-switching |
| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
| `[wireguard]` | Auto-setup script path |

`allowed_ssh_cidrs` and `allowed_wireguard_cidrs` accept either explicit CIDRs such as
`["203.0.113.4/32"]` or `["auto"]`. `auto` resolves the current public operator IP at runtime;
set `HYPERSTACK_OPERATOR_CIDR` to override that detection when needed.

SSH host keys are pinned per state file in `<state>.known_hosts`. `delete` and `--replace`
clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed.

## Automated setup reference

`hyperstack.rb` handles the full VM lifecycle automatically. All steps below
(VM creation, WireGuard tunnel, vLLM Docker container) run in a single command.

### Single-VM setup

```bash
# Deploy VM, configure WireGuard tunnel, pull and start vLLM (~10 min)
ruby hyperstack.rb create

# Run end-to-end inference test over the tunnel
ruby hyperstack.rb test

# Launch Pi coding agent connected to GPT-OSS 120B on the VM
pi-hyperstack   # fish abbreviation from hyperstack.fish

# Tear down the VM and remove WireGuard peer
ruby hyperstack.rb delete
```

### Two-VM setup

```bash
# Deploy both VMs in parallel, set up tunnel and vLLM on each (~10 min)
ruby hyperstack.rb create-both

# Test each VM individually
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test

# Launch Pi coding agents — one per terminal
pi-hyperstack-nemotron   # fish abbreviation → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder      # fish abbreviation → Qwen3-Coder-Next 80B on VM2

# Tear down both VMs
ruby hyperstack.rb delete-both
```

### Hot-switching models without reprovisioning

```bash
# Switch the running vLLM container to a different model preset
ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super
```

See the [VM configuration](#vm-configuration) and [Switching models](#switching-models)
sections for available presets and config options.

## Manual vLLM Docker setup

This section covers manual vLLM deployment for debugging or running outside the
automation. The `hyperstack.rb` provisioner handles all of this automatically.

### Prerequisites

- VM with NVIDIA GPU, CUDA ≥ 12.x, driver ≥ 535, and Docker with `nvidia-container-toolkit`
- WireGuard `wg1` tunnel configured (see `wg1-setup.sh`)
- If Ollama was previously running: `sudo systemctl stop ollama && sudo systemctl disable ollama`

### Storage setup

Model cache on ephemeral NVMe (fast; re-downloads if lost on VM restart):

```bash
sudo mkdir -p /ephemeral/hug
sudo chmod -R 0777 /ephemeral/hug
```

### Run the vLLM container

The model downloads on first start (~45 GB, ~2.5 min). Cold start after download: ~4–5 min.

```bash
docker pull vllm/vllm-openai:latest

docker run -d \
  --gpus all \
  --ipc=host \
  --network host \
  --name vllm_qwen3 \
  --restart always \
  -v /ephemeral/hug:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 \
  --max-model-len 262144 \
  --host 0.0.0.0 \
  --port 11434
```

Key flags:

| Flag | Purpose |
|------|---------|
| `--gpus all` | Expose all GPUs to the container |
| `--ipc=host` | Shared memory required by CUDA (avoids `/dev/shm` limits) |
| `--network host` | Host networking so WireGuard port 11434 is directly reachable |
| `--restart always` | Auto-restart the container on VM reboot |
| `-v /ephemeral/hug:...` | Model cache on fast ephemeral NVMe |
| `--tensor-parallel-size 1` | Single GPU (use 2/4 for multi-GPU) |
| `--enable-auto-tool-choice` | Enable function/tool calling |
| `--tool-call-parser qwen3_coder` | Parser for Qwen3-Coder tool format |
| `--enable-prefix-caching` | Block-level KV cache reuse across requests |
| `--gpu-memory-utilization 0.92` | Use 92% of VRAM; rest for OS/overhead |
| `--max-model-len 262144` | Full 256k context window |
| `--host 0.0.0.0` | Bind to all interfaces (WireGuard access requires this) |
| `--port 11434` | Reuse Ollama port for firewall compatibility |

### Verify startup

```bash
# Wait for "Application startup complete"
docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"

# Confirm model is loaded
curl -s http://localhost:11434/v1/models | python3 -m json.tool

# Quick inference test
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":50}'
```

### Firewall

```bash
sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp comment 'vLLM via wg1'
```

### Client configuration

Use the VM's WireGuard IP (`.1` for VM1, `.3` for VM2):

```bash
# VM1 (hyperstack1.wg1 = 192.168.3.1)
OPENAI_BASE_URL=http://192.168.3.1:11434/v1 OPENAI_API_KEY=EMPTY pi

# VM2 (hyperstack2.wg1 = 192.168.3.3)
OPENAI_BASE_URL=http://192.168.3.3:11434/v1 OPENAI_API_KEY=EMPTY pi
```

### Replacing the running container

To serve a different model, stop the current container and start a new one:

```bash
docker stop vllm_qwen3 && docker rm vllm_qwen3

# Example: smaller 30B model (fits easily, faster)
docker run -d \
  --gpus all --ipc=host --network host \
  --name vllm_qwen3_30b --restart always \
  -v /ephemeral/hug:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen3-Coder-30B-AWQ \
  --tensor-parallel-size 1 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.92 --max-model-len 131072 \
  --host 0.0.0.0 --port 11434
```

## Why vLLM instead of Ollama

- **FlashAttention v2**: ~1.5–2× faster prefill for long prompts
- **Block-level prefix caching**: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0)
- **Chunked prefill**: can interleave prefill and decode
- **Marlin kernels** for AWQ MoE quantization

## Monitoring vLLM

The `watch` command provides a built-in terminal dashboard that polls all active VMs every 5 seconds:

```bash
ruby hyperstack.rb watch
```

When two VMs are active the panels are shown side-by-side; a single VM uses a vertical layout.
Press `Ctrl-C` to exit.

Each VM panel shows:

| Row | Source | What it means |
|-----|--------|---------------|
| GPU header | `nvidia-smi` | Device index, name, temperature, power draw |
| **util** bar | `nvidia-smi` | GPU compute utilisation % |
| **VRAM** bar | `nvidia-smi` | GPU memory used / total |
| **throughput** | vLLM engine log | Rolling-average prefill tok/s and decode tok/s |
| **requests** | vLLM engine log | Running / waiting / swapped request counts |
| **KV cache** bar | vLLM engine log | GPU KV-cache fill % |
| **cache hits** bar | vLLM engine log | Prefix-cache hit rate % |

Stats are collected via a single SSH call per VM over the WireGuard tunnel (`hyperstack1.wg1` etc.).
`nvidia-smi` provides hardware metrics; vLLM engine stats are read from `docker logs --tail 200`
filtered to the "Engine 0" line that vLLM emits every few seconds.

For lower-level ad-hoc inspection:

```bash
# Live engine stats (throughput, KV cache, prefix cache hit rate)
ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 0"'

# GPU stats (every 5 s)
ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'

# Last-minute stats (one-shot, no follow)
ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 0"'

# Request-level monitoring
ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"'
```

Engine metrics key fields:

| Field | Meaning |
|-------|---------|
| Avg prompt throughput | Prefill speed (tokens/s) — higher is faster |
| Avg generation throughput | Decode speed (tokens/s) |
| GPU KV cache usage | % of KV cache memory in use (proportional to active context vs max capacity) |
| Prefix cache hit rate | % of prompt tokens served from cache |
| Running / Waiting | Active and queued request counts |

Healthy baseline (H100 SXM 80GB, Nemotron-3-Super-120B AWQ):

| Metric | Expected |
|--------|----------|
| Prefill throughput | 5,000–11,000 tok/s |
| Decode throughput | 20–100 tok/s (varies with batch size) |
| KV cache usage | 2–5% for typical sessions |
| Temperature | 50–70°C under load, <50°C idle |
| Power | ~100 W idle, 300–350 W under load per GPU |

Warning signs:

- **Waiting > 0 for extended periods** — requests queuing, model overloaded
- **KV cache usage near 100%** — context too long, reduce `--max-model-len`
- **Decode throughput < 20 tok/s sustained** — possible thermal throttling
- **Prefill throughput < 2,000 tok/s** — check for CPU offload or driver issues

## Troubleshooting

| Problem | Fix |
|---------|-----|
| OOM on startup with `--max-model-len 262144` | Reduce to `131072` or `65536` |
| Prefix cache hit rate stays at 0% | Normal when prompts vary heavily turn-to-turn |
| vLLM container won't start (CUDA mismatch) | Check `nvidia-smi`; vLLM requires CUDA ≥ 12.x and driver ≥ 535 |
| Still OOM after reducing context | Lower `gpu_memory_utilization` to `0.85` or use a smaller model |

## VRAM sizing guide

Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable):

| Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) |
|---|---|---|
| 7–8B | ~5 GiB | 262k+ (plenty of KV headroom) |
| 14B | ~9 GiB | 262k+ (plenty of KV headroom) |
| 30–32B | ~18 GiB | 262k (~57 GiB for KV cache) |
| 70–80B (MoE, 3B active) | ~45 GiB | 262k (~27 GiB for KV cache) |
| 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) |
| 120B+ | won't fit | use multi-GPU or smaller quant |

Supported quantization formats:

- **AWQ** (recommended): fast Marlin kernels, good quality
- **GPTQ**: similar to AWQ, widely available
- **FP8**: 8-bit, needs Hopper+ GPUs (H100/H200)
- **BF16/FP16**: full precision, needs more VRAM

Search HuggingFace for vLLM-compatible quantized models:
`https://huggingface.co/models?search=<model-name>+awq`

## Performance characteristics

Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:

| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) |
|--------|-------------------|-----------------|
| Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) |
| Decode throughput | 40–99 tok/s | ~40 tok/s |
| Per-turn latency | ~10–15 s | ~28 s (32k ctx) |
| Context window | 262k (full, no truncation) | 32k (was truncating) |
| VRAM usage | 75 GiB (more KV cache) | 52–61 GiB |

## Photo enhancement (ComfyUI)

A separate VM setup (`hyperstack-vm-photo.toml`) runs [ComfyUI](https://github.com/comfyanonymous/ComfyUI)
on an L40 GPU for Photolemur-style automatic photo enhancement. No prompts needed — drop photos in,
get enhanced photos out.

### How it works

The pipeline runs Real-ESRGAN x4plus in "enhance in place" mode:
upscale 4× (noise reduction, sharpening, colour correction) → scale back to the original resolution.
Output is saved as JPEG at quality 92, so file sizes stay close to the originals.

### Quickstart

```sh
# Provision the L40 VM (~$1/hr, ~8 min first-time setup including model download)
ruby hyperstack.rb --config hyperstack-vm-photo.toml create

# Check connectivity
ruby photo-enhance.rb --test

# Enhance all photos in a directory (outputs <name>_enhanced.jpg alongside originals)
ruby photo-enhance.rb --indir ~/Pictures/my-album

# Watch mode: process new arrivals automatically
ruby photo-enhance.rb --indir ~/Pictures/my-album --watch

# Destroy VM when done
ruby hyperstack.rb --config hyperstack-vm-photo.toml delete
```

### Configuration (`hyperstack-vm-photo.toml`)

| Key | Default | Description |
|-----|---------|-------------|
| `[vm].flavor_name` | `n3-L40x1` | Hyperstack GPU flavor (L40 48 GB, ~$1/hr) |
| `[network].wireguard_server_ip` | `192.168.3.4` | WireGuard IP (after VM1=.1, VM2=.3) |
| `[comfyui].port` | `8188` | ComfyUI REST API port (WireGuard subnet only) |
| `[comfyui].models_dir` | `/ephemeral/comfyui/models` | Model weights (ephemeral NVMe) |
| `[comfyui].models` | `["RealESRGAN_x4plus"]` | Pre-downloaded models |

### Custom workflows

The workflow JSON lives at `workflows/photo-enhance.json`. The `NODE_INPUT_IMAGE` placeholder
is substituted at runtime by `photo-enhance.rb` with the uploaded filename.
Swap in any ComfyUI-compatible workflow (e.g. add SUPIR for deeper restoration) by editing the JSON
or passing `--workflow path/to/other.json`.

### Performance (L40 48 GB)

| Operation | Time per photo |
|-----------|---------------|
| Real-ESRGAN enhance + scale back | ~50–60 s |
| Upload + download overhead | ~3 s |