# hypr
Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference.
Runs two A100 VMs concurrently — each serving a different model — with [Pi](https://pi.dev) coding agents connected to each.
## Architecture
```
┌─────────────┐
│ Laptop │
└──────┬──────┘
│ SSH (into bhyve VM, not the host)
│
FreeBSD physical host (earth)
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ FreeBSD bhyve VM (isolation layer) 192.168.3.2 / wg1 │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ ▲ │ │
│ │ │ SSH │ │
│ │ tmux session (tmux attach) │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ window 0 │ │ │
│ │ │ ┌───────────────────────┬─────────────────────────┐ │ │ │
│ │ │ │ pane 0: pi-nemotron │ pane 1: pi-coder │ │ │ │
│ │ │ │ │ │ │ │ │
│ │ │ │ Pi │ Pi │ │ │ │
│ │ │ │ Nemotron-3-Super │ Qwen3-Coder-Next │ │ │ │
│ │ │ └──────────┬────────────┘└────────────┬───────────┘ │ │ │
│ │ │ │ OpenAI API │ OpenAI API │ │ │
│ │ │ │ /v1/chat/completions │ /v1/chat/completions│ │ │
│ │ └─────────────┼──────────────────────────┼────────────────────┘ │ │
│ │ │ │ │ │
│ └──────────────┼───────────────────────────┼────────────────────────┘ │
│ │ WireGuard wg1 │ WireGuard wg1 │
└─────────────────┼───────────────────────────┼───────────────────────────┘
│ 192.168.3.0/24 │ 192.168.3.0/24
│ UDP :56710 │ UDP :56710
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ VM1 (A100 80GB) │ │ VM2 (A100 80GB) │
│ 192.168.3.1 │ │ 192.168.3.3 │
│ hyperstack1.wg1 │ │ hyperstack2.wg1 │
│ │ │ │
│ vLLM :11434 │ │ vLLM :11434 │
│ Nemotron-3-Super 120B │ │ Qwen3-Coder-Next 80B │
│ (Mamba+MoE, AWQ-4bit) │ │ (MoE, AWQ-4bit) │
└──────────────────────────┘ └──────────────────────────┘
```
**WireGuard topology:**
- Interface `wg1` on earth carries traffic to **both** VMs simultaneously
- earth is `192.168.3.2`; VM1 is `.1`; VM2 is `.3`; tunnel port is `56710/udp`
- Adding VM2 to an existing wg1 tunnel: `wg1-setup.sh` adds a second `[Peer]` block without disturbing VM1
- vLLM on each VM listens on `0.0.0.0:11434`, firewalled to `192.168.3.0/24` (WireGuard subnet only)
- Pi connects directly to each VM's vLLM over the tunnel — no proxy or load balancer
## Why Pi
- **Bring-your-own model** — connects to any OpenAI-compatible endpoint; no translation proxy needed between Pi and vLLM
- **Custom providers via `models.json`** — define `hyperstack`, `hyperstack1`, and `hyperstack2` providers once; fish abbreviations route to the right VM
- **Project-local config** — symlink this repo's `pi/` directory to `~/.pi`; Pi picks up `models.json`, `settings.json`, extensions, and skills automatically
- **TypeScript extensions** — custom behaviour (web search, loop scheduler, ask-mode) lives in `pi/agent/extensions/` and loads from the symlink
- **Minimal core** — no built-in sub-agents, plan mode, or permission popups; fast TUI with mid-session model switching via `Ctrl+L`
## Prerequisites
- Hyperstack account with API key in `~/.hyperstack`
- SSH key registered in Hyperstack as `earth` (or change `ssh.hyperstack_key_name` in the TOML)
- Review `[network].allowed_ssh_cidrs` and `[network].allowed_wireguard_cidrs` in your TOML.
The secure default is `["auto"]`, which resolves your current public egress IP to `/32`.
Set explicit CIDRs or `HYPERSTACK_OPERATOR_CIDR` if you deploy from a different network.
- WireGuard setup script: `wg1-setup.sh` (present in this directory)
- Ruby with `toml-rb` gem: `bundle install`
- [Pi](https://pi.dev) coding agent installed
## WireGuard setup
`hyperstack.rb` runs `wg1-setup.sh` automatically during `create` / `create-both`.
This section explains the tunnel design for reference and manual troubleshooting.
### Tunnel design
```
earth (192.168.3.2)
/etc/wireguard/wg1.conf
[Interface] Address = 192.168.3.2/24
[Peer] # VM1 — AllowedIPs = 192.168.3.1/32, Endpoint = :56710
[Peer] # VM2 — AllowedIPs = 192.168.3.3/32, Endpoint = :56710
```
A single `wg1` interface on earth carries traffic to both VMs. Each VM is a separate `[Peer]`
block. Adding VM2 to an existing tunnel with VM1 already running leaves VM1's peer untouched.
### Manual setup
```bash
# VM1 (first VM — generates fresh keys, writes /etc/wireguard/wg1.conf from scratch)
./wg1-setup.sh
# VM2 (additional VM — adds a [Peer] block to the existing wg1.conf)
./wg1-setup.sh 192.168.3.3 hyperstack2.wg1
```
### Verify the tunnel
```bash
# Show active peers and handshake times (both VMs should appear)
sudo wg show wg1
# Ping each VM through the tunnel
ping -c 3 192.168.3.1 # VM1
ping -c 3 192.168.3.3 # VM2
# Check vLLM is reachable over the tunnel
curl http://hyperstack1.wg1:11434/v1/models
curl http://hyperstack2.wg1:11434/v1/models
```
### Restart / recover
```bash
# Restart tunnel locally (e.g. after network change)
sudo systemctl restart wg-quick@wg1
# Restart tunnel on VM after a reboot (ssh via public IP since WireGuard is down)
ssh ubuntu@ 'sudo systemctl start wg-quick@wg1'
# Re-run setup when VM IP changes (e.g. after delete + recreate)
./wg1-setup.sh
./wg1-setup.sh 192.168.3.3 hyperstack2.wg1
```
## Quickstart (two-VM setup)
```bash
# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min)
ruby hyperstack.rb create-both
# Verify both VMs are working
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test
# Launch Pi coding agents — one per terminal (fish abbreviations from hyperstack.fish)
pi-hyperstack-nemotron # Nemotron-3-Super 120B on VM1
pi-hyperstack-coder # Qwen3-Coder-Next on VM2
# Tear down both VMs
ruby hyperstack.rb delete-both
```
## Using Pi
[Pi](https://pi.dev) is the coding agent frontend used with this setup.
Each Hyperstack VM runs a vLLM instance; Pi connects to it directly over the WireGuard tunnel.
### Installation
Install Pi from [pi.dev](https://pi.dev), then link the project-local config into place:
```bash
ln -s /path/to/hyperstack/pi ~/.pi
```
This symlink makes Pi pick up `pi/agent/models.json` and `pi/agent/settings.json`
from this repo as its agent configuration, so the Hyperstack providers and model
definitions are available without any manual config editing.
### Fish shell abbreviations
Source `hyperstack.fish` or copy the abbreviations into your Fish config:
```fish
abbr pi-hyperstack pi --model hyperstack/openai/gpt-oss-120b
abbr pi-hyperstack-nemotron pi --model hyperstack1/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
abbr pi-hyperstack-coder pi --model hyperstack2/bullpoint/Qwen3-Coder-Next-AWQ-4bit
```
Then launch a session after the VM(s) are up:
```fish
pi-hyperstack # single-VM → GPT-OSS 120B on hyperstack.wg1
pi-hyperstack-nemotron # two-VM → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder # two-VM → Qwen3-Coder-Next 80B on VM2
```
### Model configuration (`pi/agent/models.json`)
Three providers are defined, one per setup, each pointing at its vLLM endpoint over WireGuard:
| Provider | Base URL | Primary model |
|----------|----------|---------------|
| `hyperstack` | `http://hyperstack.wg1:11434/v1` | GPT-OSS 120B (single-VM) |
| `hyperstack1` | `http://hyperstack1.wg1:11434/v1` | Nemotron-3-Super 120B |
| `hyperstack2` | `http://hyperstack2.wg1:11434/v1` | Qwen3-Coder-Next 80B |
All model presets from the TOML configs are registered under each provider, so any
model can be run on any VM after a `model switch` (see [Switching models](#switching-models)).
### Settings (`pi/agent/settings.json`)
```json
{
"defaultProvider": "openai",
"defaultModel": "gpt-4.1"
}
```
The default provider/model is OpenAI so that bare `pi` uses OpenAI rather than a Hyperstack VM.
Use the fish abbreviations above to route to a specific VM.
### Hot-switching models within Pi
After loading a different model on a VM with `model switch` (see [Switching models](#switching-models)),
tell Pi to use it without restarting the session:
```
model switch hyperstack1/openai/gpt-oss-120b
```
Pi sends subsequent requests to the new model ID immediately; the provider base URL stays the same.
## Extensions
Custom extensions live in `pi/agent/extensions/` and are loaded automatically via the `~/.pi` symlink.
| Extension | Purpose |
|-----------|---------|
| `web-search` | `web_search` and `web_fetch` tools — DuckDuckGo search + page fetching, no API key |
| `ask-mode` | `/ask` command — restricts the model to read-only exploration tools |
| `loop-scheduler` | `/loop` command — re-sends a prompt on a recurring interval |
| `inline-bash` | `!{cmd}` syntax — expands shell output inline before sending to the model |
| `session-name` | Auto-names sessions from the first message |
| `modal-editor` | Opens an external editor (`$VISUAL`) for composing long prompts |
| `handoff` | Compacts and hands off context to a fresh session |
| `fresh-subagent` | Spawns a sub-agent in a clean context for isolated tasks |
| `reload-runtime` | `/reload-runtime` command — hot-reloads extensions without restarting Pi |
| `nemotron-tool-repair` | Repairs malformed tool calls from Nemotron models |
| `agent-plan-mode` | Integrates task management into Pi sessions |
### Web search
The `web-search` extension registers two LLM-callable tools:
- **`web_search`** — searches DuckDuckGo and returns up to 8 results (title, URL, snippet)
- **`web_fetch`** — fetches a URL and returns up to 12,000 characters of readable text
Example prompts:
```
Search for the vLLM 0.9.0 changelog
Find the Qwen3-Coder model card and summarize the recommended vLLM flags
```
No API key or account required. Uses DuckDuckGo's free HTML endpoint.
## Single-VM setup
A single VM can be deployed with the default config (GPT-OSS 120B):
```bash
ruby hyperstack.rb create # uses hyperstack-vm.toml
ruby hyperstack.rb test
pi-hyperstack # fish abbreviation → hyperstack/openai/gpt-oss-120b
ruby hyperstack.rb delete
```
## VM configuration
| Config file | Default model | WireGuard IP | Hostname |
|---|---|---|---|
| `hyperstack-vm1.toml` | Nemotron-3-Super 120B (AWQ-4bit) | `192.168.3.1` | `hyperstack1.wg1` |
| `hyperstack-vm2.toml` | Qwen3-Coder-Next 80B (AWQ-4bit) | `192.168.3.3` | `hyperstack2.wg1` |
| `hyperstack-vm.toml` | GPT-OSS 120B (single-VM mode) | `192.168.3.1` | `hyperstack.wg1` |
Each VM has independent state files so they can be managed separately:
```bash
ruby hyperstack.rb --config hyperstack-vm1.toml status
ruby hyperstack.rb --config hyperstack-vm2.toml status
```
## Switching models
Each VM has named model presets in its TOML config. Hot-switch without reprovisioning:
```bash
ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super
```
Available presets (both VMs share the same set):
| Preset | Model | VRAM | Context |
|---|---|---|---|
| `nemotron-super` | Nemotron-3-Super 120B (Mamba+MoE, 12B active) | ~60 GB | 131K |
| `qwen3-coder-next` | Qwen3-Coder-Next 80B (MoE, AWQ-4bit) | ~45 GB | 262K |
| `gpt-oss-120b` | GPT-OSS 120B (MoE, MXFP4) | ~65 GB | 131K |
| `gpt-oss-20b` | GPT-OSS 20B (MoE, MXFP4) | ~14 GB | 65K |
| `qwen25-coder-32b` | Qwen2.5-Coder-32B-Instruct (AWQ) | ~18 GB | 32K |
| `qwen3-coder-30b` | Qwen3-Coder-30B-A3B (MoE, AWQ) | ~18 GB | 65K |
| `deepseek-r1-32b` | DeepSeek-R1-Distill-Qwen-32B (AWQ) | ~18 GB | 32K |
| `qwen3-32b` | Qwen3-32B (AWQ) | ~18 GB | 32K |
| `devstral` | Devstral-Small-2507 (AWQ-4bit) | ~15 GB | 32K |
## CLI reference
```
ruby hyperstack.rb [--config path] [options]
Commands:
create Deploy a new VM and run full provisioning
create-both Deploy VM1 + VM2 in parallel (uses hyperstack-vm1/vm2.toml)
delete Destroy the tracked VM
delete-both Destroy both VM1 and VM2
status Show VM and WireGuard status
watch Live dashboard: vLLM + GPU stats for all active VMs (refreshes every 5 s)
test Run end-to-end inference tests (vLLM)
model switch Hot-switch the running vLLM model
create / create-both options:
--replace Delete existing tracked VM before creating
--dry-run Print the plan without making changes
--vllm / --no-vllm Override config: enable/disable vLLM setup
--ollama / --no-ollama Override config: enable/disable Ollama setup
```
## Configuration
Edit `hyperstack-vm1.toml` / `hyperstack-vm2.toml` (or `hyperstack-vm.toml` for single-VM).
Key sections:
| Section | Purpose |
|---------|---------|
| `[vm]` | Flavor, image, environment name |
| `[vllm]` | Model, container settings, and vLLM runtime options |
| `[vllm.presets.*]` | Named model presets for hot-switching |
| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
| `[wireguard]` | Auto-setup script path |
`allowed_ssh_cidrs` and `allowed_wireguard_cidrs` accept either explicit CIDRs such as
`["203.0.113.4/32"]` or `["auto"]`. `auto` resolves the current public operator IP at runtime;
set `HYPERSTACK_OPERATOR_CIDR` to override that detection when needed.
SSH host keys are pinned per state file in `.known_hosts`. `delete` and `--replace`
clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed.
## Automated setup reference
`hyperstack.rb` handles the full VM lifecycle automatically. All steps below
(VM creation, WireGuard tunnel, vLLM Docker container) run in a single command.
### Single-VM setup
```bash
# Deploy VM, configure WireGuard tunnel, pull and start vLLM (~10 min)
ruby hyperstack.rb create
# Run end-to-end inference test over the tunnel
ruby hyperstack.rb test
# Launch Pi coding agent connected to GPT-OSS 120B on the VM
pi-hyperstack # fish abbreviation from hyperstack.fish
# Tear down the VM and remove WireGuard peer
ruby hyperstack.rb delete
```
### Two-VM setup
```bash
# Deploy both VMs in parallel, set up tunnel and vLLM on each (~10 min)
ruby hyperstack.rb create-both
# Test each VM individually
ruby hyperstack.rb --config hyperstack-vm1.toml test
ruby hyperstack.rb --config hyperstack-vm2.toml test
# Launch Pi coding agents — one per terminal
pi-hyperstack-nemotron # fish abbreviation → Nemotron-3-Super 120B on VM1
pi-hyperstack-coder # fish abbreviation → Qwen3-Coder-Next 80B on VM2
# Tear down both VMs
ruby hyperstack.rb delete-both
```
### Hot-switching models without reprovisioning
```bash
# Switch the running vLLM container to a different model preset
ruby hyperstack.rb --config hyperstack-vm1.toml model switch qwen3-coder-next
ruby hyperstack.rb --config hyperstack-vm2.toml model switch nemotron-super
```
See the [VM configuration](#vm-configuration) and [Switching models](#switching-models)
sections for available presets and config options.
## Manual vLLM Docker setup
This section covers manual vLLM deployment for debugging or running outside the
automation. The `hyperstack.rb` provisioner handles all of this automatically.
### Prerequisites
- VM with NVIDIA GPU, CUDA ≥ 12.x, driver ≥ 535, and Docker with `nvidia-container-toolkit`
- WireGuard `wg1` tunnel configured (see `wg1-setup.sh`)
- If Ollama was previously running: `sudo systemctl stop ollama && sudo systemctl disable ollama`
### Storage setup
Model cache on ephemeral NVMe (fast; re-downloads if lost on VM restart):
```bash
sudo mkdir -p /ephemeral/hug
sudo chmod -R 0777 /ephemeral/hug
```
### Run the vLLM container
The model downloads on first start (~45 GB, ~2.5 min). Cold start after download: ~4–5 min.
```bash
docker pull vllm/vllm-openai:latest
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_qwen3 \
--restart always \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--max-model-len 262144 \
--host 0.0.0.0 \
--port 11434
```
Key flags:
| Flag | Purpose |
|------|---------|
| `--gpus all` | Expose all GPUs to the container |
| `--ipc=host` | Shared memory required by CUDA (avoids `/dev/shm` limits) |
| `--network host` | Host networking so WireGuard port 11434 is directly reachable |
| `--restart always` | Auto-restart the container on VM reboot |
| `-v /ephemeral/hug:...` | Model cache on fast ephemeral NVMe |
| `--tensor-parallel-size 1` | Single GPU (use 2/4 for multi-GPU) |
| `--enable-auto-tool-choice` | Enable function/tool calling |
| `--tool-call-parser qwen3_coder` | Parser for Qwen3-Coder tool format |
| `--enable-prefix-caching` | Block-level KV cache reuse across requests |
| `--gpu-memory-utilization 0.92` | Use 92% of VRAM; rest for OS/overhead |
| `--max-model-len 262144` | Full 256k context window |
| `--host 0.0.0.0` | Bind to all interfaces (WireGuard access requires this) |
| `--port 11434` | Reuse Ollama port for firewall compatibility |
### Verify startup
```bash
# Wait for "Application startup complete"
docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
# Confirm model is loaded
curl -s http://localhost:11434/v1/models | python3 -m json.tool
# Quick inference test
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
"messages":[{"role":"user","content":"Hello"}],
"max_tokens":50}'
```
### Firewall
```bash
sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp comment 'vLLM via wg1'
```
### Client configuration
Use the VM's WireGuard IP (`.1` for VM1, `.3` for VM2):
```bash
# VM1 (hyperstack1.wg1 = 192.168.3.1)
OPENAI_BASE_URL=http://192.168.3.1:11434/v1 OPENAI_API_KEY=EMPTY pi
# VM2 (hyperstack2.wg1 = 192.168.3.3)
OPENAI_BASE_URL=http://192.168.3.3:11434/v1 OPENAI_API_KEY=EMPTY pi
```
### Replacing the running container
To serve a different model, stop the current container and start a new one:
```bash
docker stop vllm_qwen3 && docker rm vllm_qwen3
# Example: smaller 30B model (fits easily, faster)
docker run -d \
--gpus all --ipc=host --network host \
--name vllm_qwen3_30b --restart always \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-Coder-30B-AWQ \
--tensor-parallel-size 1 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--enable-prefix-caching \
--gpu-memory-utilization 0.92 --max-model-len 131072 \
--host 0.0.0.0 --port 11434
```
## Why vLLM instead of Ollama
- **FlashAttention v2**: ~1.5–2× faster prefill for long prompts
- **Block-level prefix caching**: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0)
- **Chunked prefill**: can interleave prefill and decode
- **Marlin kernels** for AWQ MoE quantization
## Monitoring vLLM
The `watch` command provides a built-in terminal dashboard that polls all active VMs every 5 seconds:
```bash
ruby hyperstack.rb watch
```
When two VMs are active the panels are shown side-by-side; a single VM uses a vertical layout.
Press `Ctrl-C` to exit.
Each VM panel shows:
| Row | Source | What it means |
|-----|--------|---------------|
| GPU header | `nvidia-smi` | Device index, name, temperature, power draw |
| **util** bar | `nvidia-smi` | GPU compute utilisation % |
| **VRAM** bar | `nvidia-smi` | GPU memory used / total |
| **throughput** | vLLM engine log | Rolling-average prefill tok/s and decode tok/s |
| **requests** | vLLM engine log | Running / waiting / swapped request counts |
| **KV cache** bar | vLLM engine log | GPU KV-cache fill % |
| **cache hits** bar | vLLM engine log | Prefix-cache hit rate % |
Stats are collected via a single SSH call per VM over the WireGuard tunnel (`hyperstack1.wg1` etc.).
`nvidia-smi` provides hardware metrics; vLLM engine stats are read from `docker logs --tail 200`
filtered to the "Engine 0" line that vLLM emits every few seconds.
For lower-level ad-hoc inspection:
```bash
# Live engine stats (throughput, KV cache, prefix cache hit rate)
ssh ubuntu@ 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 0"'
# GPU stats (every 5 s)
ssh ubuntu@ 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
# Last-minute stats (one-shot, no follow)
ssh ubuntu@ 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 0"'
# Request-level monitoring
ssh ubuntu@ 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"'
```
Engine metrics key fields:
| Field | Meaning |
|-------|---------|
| Avg prompt throughput | Prefill speed (tokens/s) — higher is faster |
| Avg generation throughput | Decode speed (tokens/s) |
| GPU KV cache usage | % of KV cache memory in use (proportional to active context vs max capacity) |
| Prefix cache hit rate | % of prompt tokens served from cache |
| Running / Waiting | Active and queued request counts |
Healthy baseline (H100 SXM 80GB, Nemotron-3-Super-120B AWQ):
| Metric | Expected |
|--------|----------|
| Prefill throughput | 5,000–11,000 tok/s |
| Decode throughput | 20–100 tok/s (varies with batch size) |
| KV cache usage | 2–5% for typical sessions |
| Temperature | 50–70°C under load, <50°C idle |
| Power | ~100 W idle, 300–350 W under load per GPU |
Warning signs:
- **Waiting > 0 for extended periods** — requests queuing, model overloaded
- **KV cache usage near 100%** — context too long, reduce `--max-model-len`
- **Decode throughput < 20 tok/s sustained** — possible thermal throttling
- **Prefill throughput < 2,000 tok/s** — check for CPU offload or driver issues
## Troubleshooting
| Problem | Fix |
|---------|-----|
| OOM on startup with `--max-model-len 262144` | Reduce to `131072` or `65536` |
| Prefix cache hit rate stays at 0% | Normal when prompts vary heavily turn-to-turn |
| vLLM container won't start (CUDA mismatch) | Check `nvidia-smi`; vLLM requires CUDA ≥ 12.x and driver ≥ 535 |
| Still OOM after reducing context | Lower `gpu_memory_utilization` to `0.85` or use a smaller model |
## VRAM sizing guide
Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable):
| Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) |
|---|---|---|
| 7–8B | ~5 GiB | 262k+ (plenty of KV headroom) |
| 14B | ~9 GiB | 262k+ (plenty of KV headroom) |
| 30–32B | ~18 GiB | 262k (~57 GiB for KV cache) |
| 70–80B (MoE, 3B active) | ~45 GiB | 262k (~27 GiB for KV cache) |
| 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) |
| 120B+ | won't fit | use multi-GPU or smaller quant |
Supported quantization formats:
- **AWQ** (recommended): fast Marlin kernels, good quality
- **GPTQ**: similar to AWQ, widely available
- **FP8**: 8-bit, needs Hopper+ GPUs (H100/H200)
- **BF16/FP16**: full precision, needs more VRAM
Search HuggingFace for vLLM-compatible quantized models:
`https://huggingface.co/models?search=+awq`
## Performance characteristics
Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:
| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) |
|--------|-------------------|-----------------|
| Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) |
| Decode throughput | 40–99 tok/s | ~40 tok/s |
| Per-turn latency | ~10–15 s | ~28 s (32k ctx) |
| Context window | 262k (full, no truncation) | 32k (was truncating) |
| VRAM usage | 75 GiB (more KV cache) | 52–61 GiB |
## Photo enhancement (ComfyUI)
A separate VM setup (`hyperstack-vm-photo.toml`) runs [ComfyUI](https://github.com/comfyanonymous/ComfyUI)
on an L40 GPU for Photolemur-style automatic photo enhancement. No prompts needed — drop photos in,
get enhanced photos out.
### How it works
The pipeline runs Real-ESRGAN x4plus in "enhance in place" mode:
upscale 4× (noise reduction, sharpening, colour correction) → scale back to the original resolution.
Output is saved as JPEG at quality 92, so file sizes stay close to the originals.
### Quickstart
```sh
# Provision the L40 VM (~$1/hr, ~8 min first-time setup including model download)
ruby hyperstack.rb --config hyperstack-vm-photo.toml create
# Check connectivity
ruby photo-enhance.rb --test
# Enhance all photos in a directory (outputs _enhanced.jpg alongside originals)
ruby photo-enhance.rb --indir ~/Pictures/my-album
# Watch mode: process new arrivals automatically
ruby photo-enhance.rb --indir ~/Pictures/my-album --watch
# Destroy VM when done
ruby hyperstack.rb --config hyperstack-vm-photo.toml delete
```
### Configuration (`hyperstack-vm-photo.toml`)
| Key | Default | Description |
|-----|---------|-------------|
| `[vm].flavor_name` | `n3-L40x1` | Hyperstack GPU flavor (L40 48 GB, ~$1/hr) |
| `[network].wireguard_server_ip` | `192.168.3.4` | WireGuard IP (after VM1=.1, VM2=.3) |
| `[comfyui].port` | `8188` | ComfyUI REST API port (WireGuard subnet only) |
| `[comfyui].models_dir` | `/ephemeral/comfyui/models` | Model weights (ephemeral NVMe) |
| `[comfyui].models` | `["RealESRGAN_x4plus"]` | Pre-downloaded models |
### Custom workflows
The workflow JSON lives at `workflows/photo-enhance.json`. The `NODE_INPUT_IMAGE` placeholder
is substituted at runtime by `photo-enhance.rb` with the uploaded filename.
Swap in any ComfyUI-compatible workflow (e.g. add SUPIR for deeper restoration) by editing the JSON
or passing `--workflow path/to/other.json`.
### Performance (L40 48 GB)
| Operation | Time per photo |
|-----------|---------------|
| Real-ESRGAN enhance + scale back | ~50–60 s |
| Upload + download overhead | ~3 s |