diff options
| author | Paul Buetow <paul@buetow.org> | 2026-03-21 12:46:30 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-03-21 12:46:30 +0200 |
| commit | dd621fefb33ee006f8d2855caa9f88a268717a9a (patch) | |
| tree | 45fec50ae97950cd2eb31bfeea41782919c865fa | |
| parent | d54c42d6c6a9d559b912c7f5330397e236ea407b (diff) | |
Consolidate vllm-setup.txt into README.md and remove the file
Merged all still-relevant content from vllm-setup.txt into README.md:
- Why vLLM over Ollama section
- Full monitoring commands with engine metrics table
- Troubleshooting table
- VRAM sizing guide
- Performance characteristics table
Dropped LiteLLM, Anthropic API, Claude Code, and OpenCode sections
which are no longer applicable. Removes the vllm-setup.txt file.
| -rw-r--r-- | README.md | 75 | ||||
| -rwxr-xr-x | hyperstack.rb | 54 | ||||
| -rw-r--r-- | pi/agent/extensions/loop-scheduler/README.md | 2 | ||||
| -rw-r--r-- | vllm-setup.txt | 322 |
4 files changed, 128 insertions, 325 deletions
@@ -183,6 +183,13 @@ set `HYPERSTACK_OPERATOR_CIDR` to override that detection when needed. SSH host keys are pinned per state file in `<state>.known_hosts`. `delete` and `--replace` clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed. +## Why vLLM instead of Ollama + +- **FlashAttention v2**: ~1.5–2× faster prefill for long prompts +- **Block-level prefix caching**: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0) +- **Chunked prefill**: can interleave prefill and decode +- **Marlin kernels** for AWQ MoE quantization + ## Monitoring vLLM ```bash @@ -192,8 +199,23 @@ ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"' # GPU stats (every 5 s) ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5' +# Last-minute stats (one-shot, no follow) +ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 000"' + +# Request-level monitoring +ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"' ``` +Engine metrics key fields: + +| Field | Meaning | +|-------|---------| +| Avg prompt throughput | Prefill speed (tokens/s) — higher is faster | +| Avg generation throughput | Decode speed (tokens/s) — ~40–99 on A100 PCIe | +| GPU KV cache usage | % of KV cache memory in use (proportional to active context vs max capacity) | +| Prefix cache hit rate | % of prompt tokens served from cache | +| Running / Waiting | Active and queued request counts | + Healthy baseline (A100 80GB PCIe): | Metric | Expected | @@ -201,5 +223,56 @@ Healthy baseline (A100 80GB PCIe): | Prefill throughput | 5,000–11,000 tok/s | | Decode throughput | 40–99 tok/s | | KV cache usage | 2–5% for typical sessions | +| Temperature | 44–60°C under load, <45°C idle | +| Power | 70 W idle, 230–240 W under load, 300 W max | + +Warning signs: + +- **Waiting > 0 for extended periods** — requests queuing, model overloaded +- **KV cache usage near 100%** — context too long, reduce `--max-model-len` +- **Decode throughput < 20 tok/s sustained** — possible thermal throttling +- **Prefill throughput < 2,000 tok/s** — check for CPU offload or driver issues + +## Troubleshooting + +| Problem | Fix | +|---------|-----| +| OOM on startup with `--max-model-len 262144` | Reduce to `131072` or `65536` | +| Prefix cache hit rate stays at 0% | Normal when prompts vary heavily turn-to-turn | +| vLLM container won't start (CUDA mismatch) | Check `nvidia-smi`; vLLM requires CUDA ≥ 12.x and driver ≥ 535 | +| Still OOM after reducing context | Lower `gpu_memory_utilization` to `0.85` or use a smaller model | + +## VRAM sizing guide + +Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable): + +| Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) | +|---|---|---| +| 7–8B | ~5 GiB | 262k+ (plenty of KV headroom) | +| 14B | ~9 GiB | 262k+ (plenty of KV headroom) | +| 30–32B | ~18 GiB | 262k (~57 GiB for KV cache) | +| 70–80B (MoE, 3B active) | ~45 GiB | 262k (~27 GiB for KV cache) | +| 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) | +| 120B+ | won't fit | use multi-GPU or smaller quant | + +Supported quantization formats: + +- **AWQ** (recommended): fast Marlin kernels, good quality +- **GPTQ**: similar to AWQ, widely available +- **FP8**: 8-bit, needs Hopper+ GPUs (H100/H200) +- **BF16/FP16**: full precision, needs more VRAM + +Search HuggingFace for vLLM-compatible quantized models: +`https://huggingface.co/models?search=<model-name>+awq` + +## Performance characteristics + +Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit: -See `vllm-setup.txt` for detailed vLLM setup notes, VRAM sizing guide, and troubleshooting. +| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) | +|--------|-------------------|-----------------| +| Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) | +| Decode throughput | 40–99 tok/s | ~40 tok/s | +| Per-turn latency | ~10–15 s | ~28 s (32k ctx) | +| Context window | 262k (full, no truncation) | 32k (was truncating) | +| VRAM usage | 75 GiB (more KV cache) | 52–61 GiB | diff --git a/hyperstack.rb b/hyperstack.rb index a3af491..f7bfe69 100755 --- a/hyperstack.rb +++ b/hyperstack.rb @@ -718,6 +718,10 @@ module HyperstackVM request(:post, "/core/virtual-machines/#{vm_id}/sg-rules", payload) end + def delete_vm_rule(vm_id, rule_id) + request(:delete, "/core/virtual-machines/#{vm_id}/sg-rules/#{rule_id}") + end + private def request(method, path, payload = nil) @@ -1225,6 +1229,21 @@ module HyperstackVM script.join("\n") end + def litellm_decommission_script + script = [] + script << 'set -euo pipefail' + script << 'sudo systemctl stop litellm 2>/dev/null || true' + script << 'sudo systemctl disable litellm 2>/dev/null || true' + script << 'sudo rm -f /etc/systemd/system/litellm.service' + script << 'sudo systemctl daemon-reload' + script << 'sudo rm -f /ephemeral/litellm-config.yaml' + script << 'sudo rm -rf /ephemeral/litellm-env' + script << 'sudo rm -f /ephemeral/litellm.log' + script << "sudo ufw --force delete allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port 4000 proto tcp >/dev/null 2>&1 || true" + script << 'echo litellm-decommission-ok' + script.join("\n") + end + private def normalized_model_list(models) @@ -1289,6 +1308,12 @@ module HyperstackVM raise Error, "vLLM install failed: #{output.strip}" unless status.success? end + def decommission_litellm(host) + info "Removing deprecated LiteLLM service from #{host} if present..." + output, status = @ssh_stream_runner.call(host, @scripts.litellm_decommission_script) + raise Error, "LiteLLM decommission failed: #{output.strip}" unless status.success? + end + def setup_vllm_stack(host, preset_config: nil) install_vllm(host, preset_config: preset_config) end @@ -1508,6 +1533,8 @@ module HyperstackVM host = state['public_ip'] raise Error, 'No public IP in state file.' if host.nil? || host.empty? + @provisioner.decommission_litellm(host) + # Stop the old container only when it has a different name from the new one. if old_container != new_container @provisioner.stop_vllm_container(host, old_container) @@ -1574,6 +1601,7 @@ module HyperstackVM @state_store.save(state) wait_for_ssh(state['public_ip']) + @provisioner.decommission_litellm(state['public_ip']) if @config.guest_bootstrap_enabled? && state['bootstrapped_at'].nil? @provisioner.bootstrap_guest(state['public_ip']) state['bootstrapped_at'] = Time.now.utc.iso8601 @@ -1730,13 +1758,27 @@ module HyperstackVM end def ensure_security_rules(vm) - existing = Array(vm['security_rules']).map { |rule| normalize_rule(rule) } + existing_rules = Array(vm['security_rules']) + existing = existing_rules.map { |rule| normalize_rule(rule) } desired = desired_security_rules.map { |rule| normalize_rule(rule) } (desired - existing).each do |rule| info "Adding Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..." @client.create_vm_rule(vm['id'], rule) end + + legacy_litellm_rules(existing_rules).each do |rule| + rule_id = rule['id'] || rule['rule_id'] + unless rule_id + warn 'Found legacy Hyperstack firewall rule for port 4000, but the API payload has no rule id; remove it manually from the Hyperstack console.' + next + end + + info "Removing legacy Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..." + @client.delete_vm_rule(vm['id'], rule_id) + rescue Error => e + warn "Failed to remove legacy Hyperstack firewall rule #{rule_id}: #{e.message}" + end end def ollama_setup_needed?(state) @@ -1998,6 +2040,16 @@ module HyperstackVM desired_security_rules(include_vllm: state_vllm_enabled?(state), include_ollama: state_ollama_enabled?(state)) end + def legacy_litellm_rules(rules) + Array(rules).select do |rule| + normalized = normalize_rule(rule) + normalized['protocol'] == 'tcp' && + normalized['port_range_min'] == 4000 && + normalized['port_range_max'] == 4000 && + normalized['remote_ip_prefix'] == @config.wireguard_subnet + end + end + def state_vllm_enabled?(state) recorded = state&.dig('services', 'vllm_enabled') return recorded unless recorded.nil? diff --git a/pi/agent/extensions/loop-scheduler/README.md b/pi/agent/extensions/loop-scheduler/README.md index 78a6635..65ab10f 100644 --- a/pi/agent/extensions/loop-scheduler/README.md +++ b/pi/agent/extensions/loop-scheduler/README.md @@ -2,7 +2,7 @@ Session-scoped recurring prompts for Pi. -This extension adds a Claude-Code-style `/loop` command for interactive Pi +This extension adds a recurring `/loop` command for interactive Pi sessions. It schedules a prompt to be re-sent on an interval while the current Pi process stays open. diff --git a/vllm-setup.txt b/vllm-setup.txt deleted file mode 100644 index 9ff424e..0000000 --- a/vllm-setup.txt +++ /dev/null @@ -1,322 +0,0 @@ -# vLLM Setup for Hyperstack VM -# -# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit) -# via vLLM exposed directly on the OpenAI-compatible API. -# -# Architecture: -# -# Pi (earth) Hyperstack VM (A100 80GB) -# ┌─────────────┐ ┌──────────────────────────────┐ -# │ pi │── OpenAI API ──────> │ vLLM engine (:11434) │ -# │ │ /v1/chat/completions│ FlashAttention v2 │ -# └─────────────┘ via WireGuard wg1 │ prefix caching │ -# │ bullpoint/Qwen3-Coder- │ -# │ Next-AWQ-4bit (45GB) │ -# └──────────────────────────────┘ -# -# Why vLLM instead of Ollama: -# - FlashAttention v2: ~1.5-2x faster prefill for long prompts -# - Block-level prefix caching: partial KV cache reuse even when prompt -# changes mid-sequence (Ollama requires exact prefix match from token 0) -# - Chunked prefill: can interleave prefill and decode -# - Marlin kernels for AWQ MoE quantization -# -# Model details: -# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace) -# - Architecture: MoE, 80B total params, 3B active per token -# - 512 experts, 10 activated + 1 shared per token -# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers) -# - Quantization: AWQ 4-bit, group size 32 -# - Disk size: ~45GB (vs ~151GB at BF16) -# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization -# - Context: 262,144 tokens (256k native) -# - vLLM requirement: >= 0.15.0 -# -# Hardware requirements: -# - Minimum: 1x A100 80GB (PCIe or SXM) -# - VRAM breakdown at gpu_memory_utilization=0.92: -# Model weights: ~45 GiB -# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k) -# CUDA graphs: ~3 GiB -# Total: ~75 GiB / 80 GiB -# -# Ports: -# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat) -# Restricted to 192.168.3.0/24 (WireGuard wg1 subnet) - -# =========================================================================== -# STEP 1: Prerequisites -# =========================================================================== -# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit -# - WireGuard wg1 tunnel already configured (see wg1-setup.sh) -# - Ollama stopped and disabled if previously running: -# -# sudo systemctl stop ollama -# sudo systemctl disable ollama - -# =========================================================================== -# STEP 2: Storage setup -# =========================================================================== -# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots -# on some providers but not guaranteed — model will re-download if lost). -# -# sudo mkdir -p /ephemeral/hug -# sudo chmod -R 0777 /ephemeral/hug - -# =========================================================================== -# STEP 3: vLLM Docker container -# =========================================================================== -# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min). -# After download, model loading takes ~65s and CUDA graph capture ~35s. -# Total cold start: ~4-5 minutes. -# -# docker pull vllm/vllm-openai:latest -# -# docker run -d \ -# --gpus all \ -# --ipc=host \ -# --network host \ -# --name vllm_qwen3 \ -# --restart always \ -# -v /ephemeral/hug:/root/.cache/huggingface \ -# vllm/vllm-openai:latest \ -# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \ -# --tensor-parallel-size 1 \ -# --enable-auto-tool-choice \ -# --tool-call-parser qwen3_coder \ -# --enable-prefix-caching \ -# --gpu-memory-utilization 0.92 \ -# --max-model-len 262144 \ -# --host 0.0.0.0 \ -# --port 11434 -# -# Flags explained: -# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups) -# --enable-auto-tool-choice Enables function/tool calling -# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format -# --enable-prefix-caching Block-level KV cache reuse across requests -# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead) -# --max-model-len 262144 Full 256k context window -# --port 11434 Reuse Ollama port for firewall compatibility -# -# Verify startup (wait for "Application startup complete"): -# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error" -# -# Verify model loaded: -# curl -s http://localhost:11434/v1/models | python3 -m json.tool -# -# Quick inference test: -# curl -s http://localhost:11434/v1/chat/completions \ -# -H "Content-Type: application/json" \ -# -H "Authorization: Bearer EMPTY" \ -# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit", -# "messages":[{"role":"user","content":"Hello"}], -# "max_tokens":50}' -# -# Monitor performance (prefix cache hit rate, throughput): -# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" - -# =========================================================================== -# STEP 4: Firewall rules -# =========================================================================== -# Allow access from WireGuard subnet only: -# -# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \ -# comment 'vLLM via wg1' -# =========================================================================== -# STEP 5: Client configuration (on earth / local machine) -# =========================================================================== -# -# Launch Pi or any OpenAI-compatible client directly against vLLM: -# -# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \ -# OPENAI_API_KEY=EMPTY \ -# pi - -# =========================================================================== -# STEP 6: Monitoring & troubleshooting -# =========================================================================== -# -# --- Live engine stats --- -# vLLM logs engine metrics every 10 seconds. Key fields: -# - Avg prompt throughput: prefill speed (tokens/s), higher = faster -# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe -# - GPU KV cache usage: % of KV cache memory in use (proportional to -# active context length vs max capacity) -# - Prefix cache hit rate: % of prompt tokens served from cache -# - Running/Waiting: active and queued request counts -# -# Follow live (all stats): -# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" -# -# Example output: -# Engine 000: Avg prompt throughput: 5555.2 tokens/s, -# Avg generation throughput: 49.4 tokens/s, -# Running: 1 reqs, Waiting: 0 reqs, -# GPU KV cache usage: 4.6%, -# Prefix cache hit rate: 0.0% -# -# --- Request-level monitoring --- -# See individual HTTP requests (method, status, duration): -# docker logs -f vllm_qwen3 2>&1 | grep "POST" -# -# Example output: -# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK -# -# --- One-liner: last minute stats --- -# Useful for periodic checks without following the log: -# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000" -# -# --- GPU hardware stats --- -# Snapshot: -# nvidia-smi -# -# Continuous (every 5 seconds): -# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \ -# --format=csv -l 5 -# -# --- Interpreting the stats --- -# -# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit): -# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill) -# Decode throughput: 40-99 tok/s (varies with output length per sample) -# KV cache usage: 0-5% for short conversations, grows with context -# (100% = 298k tokens, at which point requests queue) -# Prefix cache hit: depends on prompt reuse; higher is better -# Temperature: 44-60C under load, <45C idle -# Power: 70W idle, 230-240W under load, 300W max -# -# Warning signs: -# - Waiting > 0 for extended periods → requests queuing, model overloaded -# - KV cache usage near 100% → context too long, reduce --max-model-len -# - Decode throughput < 20 tok/s sustained → possible thermal throttling -# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues -# -# Common issues: -# -# 1. OOM on startup with --max-model-len 262144 -# → Reduce to 131072 or 65536 -# -# 2. Prefix cache hit rate stays at 0% -# → Normal when prompts vary heavily turn-to-turn -# -# 3. vLLM container won't start (CUDA version mismatch) -# → Check driver version: nvidia-smi -# → vLLM requires CUDA >= 12.x and driver >= 535 - -# =========================================================================== -# STEP 7: Loading / switching models -# =========================================================================== -# -# vLLM serves one model per container. To switch models, stop the current -# container and start a new one with different --model. -# -# --- Stop current model --- -# docker stop vllm_qwen3 -# docker rm vllm_qwen3 -# -# --- Run a different model --- -# Replace --model, --name, and adjust --max-model-len and --tool-call-parser -# as needed. The HuggingFace model downloads automatically on first start. -# -# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB) -# -# docker run -d \ -# --gpus all \ -# --ipc=host \ -# --network host \ -# --name vllm_qwen3_30b \ -# --restart always \ -# -v /ephemeral/hug:/root/.cache/huggingface \ -# vllm/vllm-openai:latest \ -# --model Qwen/Qwen3-Coder-30B-AWQ \ -# --tensor-parallel-size 1 \ -# --enable-auto-tool-choice \ -# --tool-call-parser qwen3_coder \ -# --enable-prefix-caching \ -# --gpu-memory-utilization 0.92 \ -# --max-model-len 131072 \ -# --host 0.0.0.0 \ -# --port 11434 -# -# Example: full-precision model on multi-GPU (e.g. 4x H100) -# -# docker run -d \ -# --gpus all \ -# --ipc=host \ -# --network host \ -# --name vllm_qwen3_fp16 \ -# --restart always \ -# -v /ephemeral/hug:/root/.cache/huggingface \ -# vllm/vllm-openai:latest \ -# --model Qwen/Qwen3-Coder-Next \ -# --tensor-parallel-size 4 \ -# --enable-auto-tool-choice \ -# --tool-call-parser qwen3_coder \ -# --enable-prefix-caching \ -# --gpu-memory-utilization 0.90 \ -# --max-model-len 262144 \ -# --host 0.0.0.0 \ -# --port 11434 -# -# --- Finding models --- -# Search HuggingFace for vLLM-compatible quantized models: -# https://huggingface.co/models?search=<model-name>+awq -# https://huggingface.co/models?search=<model-name>+gptq -# -# Supported quantization formats in vLLM: -# - AWQ (recommended): fast Marlin kernels, good quality -# - GPTQ: similar to AWQ, widely available -# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200) -# - BF16/FP16: full precision, needs more VRAM -# -# --- VRAM sizing guide --- -# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable): -# -# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) -# ---------------------|----------------|------------------------------- -# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom) -# 14B | ~9 GiB | 262k+ (plenty of KV headroom) -# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache) -# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache) -# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) -# 120B+ | won't fit | use multi-GPU or smaller quant -# -# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly -# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization -# to 0.85 or try a smaller model. -# -# --- Verifying the new model --- -# Check loaded model: -# curl -s http://localhost:11434/v1/models | python3 -m json.tool -# -# Test inference: -# curl -s http://localhost:11434/v1/chat/completions \ -# -H "Content-Type: application/json" \ -# -H "Authorization: Bearer EMPTY" \ -# -d '{"model":"<model-name>", -# "messages":[{"role":"user","content":"Hello"}], -# "max_tokens":50}' -# -# =========================================================================== -# Performance characteristics (A100 80GB PCIe, single GPU) -# =========================================================================== -# -# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit: -# -# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2) -# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited) -# Per-turn latency: ~10-15s (small prompts, early conversation) -# KV cache usage: 2-5% for typical coding sessions -# Prefix cache hit rate: workload-dependent -# -# Comparison with Ollama on same hardware (A100 80GB PCIe): -# -# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit) -# -----------------------|-----------------------|---------------------- -# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s -# Decode throughput | ~40 tok/s | 40-99 tok/s -# Per-turn latency | ~28s (32k ctx) | ~10-15s -# Context window | 32k (was truncating) | 262k (full, no truncation) -# Prefix cache | workload-dependent | workload-dependent -# VRAM usage | 52-61 GiB | 75 GiB (more KV cache) |
