summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-03-21 12:46:30 +0200
committerPaul Buetow <paul@buetow.org>2026-03-21 12:46:30 +0200
commitdd621fefb33ee006f8d2855caa9f88a268717a9a (patch)
tree45fec50ae97950cd2eb31bfeea41782919c865fa
parentd54c42d6c6a9d559b912c7f5330397e236ea407b (diff)
Consolidate vllm-setup.txt into README.md and remove the file
Merged all still-relevant content from vllm-setup.txt into README.md: - Why vLLM over Ollama section - Full monitoring commands with engine metrics table - Troubleshooting table - VRAM sizing guide - Performance characteristics table Dropped LiteLLM, Anthropic API, Claude Code, and OpenCode sections which are no longer applicable. Removes the vllm-setup.txt file.
-rw-r--r--README.md75
-rwxr-xr-xhyperstack.rb54
-rw-r--r--pi/agent/extensions/loop-scheduler/README.md2
-rw-r--r--vllm-setup.txt322
4 files changed, 128 insertions, 325 deletions
diff --git a/README.md b/README.md
index 93ddc58..690490b 100644
--- a/README.md
+++ b/README.md
@@ -183,6 +183,13 @@ set `HYPERSTACK_OPERATOR_CIDR` to override that detection when needed.
SSH host keys are pinned per state file in `<state>.known_hosts`. `delete` and `--replace`
clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed.
+## Why vLLM instead of Ollama
+
+- **FlashAttention v2**: ~1.5–2× faster prefill for long prompts
+- **Block-level prefix caching**: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0)
+- **Chunked prefill**: can interleave prefill and decode
+- **Marlin kernels** for AWQ MoE quantization
+
## Monitoring vLLM
```bash
@@ -192,8 +199,23 @@ ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"'
# GPU stats (every 5 s)
ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
+# Last-minute stats (one-shot, no follow)
+ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 000"'
+
+# Request-level monitoring
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"'
```
+Engine metrics key fields:
+
+| Field | Meaning |
+|-------|---------|
+| Avg prompt throughput | Prefill speed (tokens/s) — higher is faster |
+| Avg generation throughput | Decode speed (tokens/s) — ~40–99 on A100 PCIe |
+| GPU KV cache usage | % of KV cache memory in use (proportional to active context vs max capacity) |
+| Prefix cache hit rate | % of prompt tokens served from cache |
+| Running / Waiting | Active and queued request counts |
+
Healthy baseline (A100 80GB PCIe):
| Metric | Expected |
@@ -201,5 +223,56 @@ Healthy baseline (A100 80GB PCIe):
| Prefill throughput | 5,000–11,000 tok/s |
| Decode throughput | 40–99 tok/s |
| KV cache usage | 2–5% for typical sessions |
+| Temperature | 44–60°C under load, <45°C idle |
+| Power | 70 W idle, 230–240 W under load, 300 W max |
+
+Warning signs:
+
+- **Waiting > 0 for extended periods** — requests queuing, model overloaded
+- **KV cache usage near 100%** — context too long, reduce `--max-model-len`
+- **Decode throughput < 20 tok/s sustained** — possible thermal throttling
+- **Prefill throughput < 2,000 tok/s** — check for CPU offload or driver issues
+
+## Troubleshooting
+
+| Problem | Fix |
+|---------|-----|
+| OOM on startup with `--max-model-len 262144` | Reduce to `131072` or `65536` |
+| Prefix cache hit rate stays at 0% | Normal when prompts vary heavily turn-to-turn |
+| vLLM container won't start (CUDA mismatch) | Check `nvidia-smi`; vLLM requires CUDA ≥ 12.x and driver ≥ 535 |
+| Still OOM after reducing context | Lower `gpu_memory_utilization` to `0.85` or use a smaller model |
+
+## VRAM sizing guide
+
+Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable):
+
+| Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) |
+|---|---|---|
+| 7–8B | ~5 GiB | 262k+ (plenty of KV headroom) |
+| 14B | ~9 GiB | 262k+ (plenty of KV headroom) |
+| 30–32B | ~18 GiB | 262k (~57 GiB for KV cache) |
+| 70–80B (MoE, 3B active) | ~45 GiB | 262k (~27 GiB for KV cache) |
+| 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) |
+| 120B+ | won't fit | use multi-GPU or smaller quant |
+
+Supported quantization formats:
+
+- **AWQ** (recommended): fast Marlin kernels, good quality
+- **GPTQ**: similar to AWQ, widely available
+- **FP8**: 8-bit, needs Hopper+ GPUs (H100/H200)
+- **BF16/FP16**: full precision, needs more VRAM
+
+Search HuggingFace for vLLM-compatible quantized models:
+`https://huggingface.co/models?search=<model-name>+awq`
+
+## Performance characteristics
+
+Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:
-See `vllm-setup.txt` for detailed vLLM setup notes, VRAM sizing guide, and troubleshooting.
+| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) |
+|--------|-------------------|-----------------|
+| Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) |
+| Decode throughput | 40–99 tok/s | ~40 tok/s |
+| Per-turn latency | ~10–15 s | ~28 s (32k ctx) |
+| Context window | 262k (full, no truncation) | 32k (was truncating) |
+| VRAM usage | 75 GiB (more KV cache) | 52–61 GiB |
diff --git a/hyperstack.rb b/hyperstack.rb
index a3af491..f7bfe69 100755
--- a/hyperstack.rb
+++ b/hyperstack.rb
@@ -718,6 +718,10 @@ module HyperstackVM
request(:post, "/core/virtual-machines/#{vm_id}/sg-rules", payload)
end
+ def delete_vm_rule(vm_id, rule_id)
+ request(:delete, "/core/virtual-machines/#{vm_id}/sg-rules/#{rule_id}")
+ end
+
private
def request(method, path, payload = nil)
@@ -1225,6 +1229,21 @@ module HyperstackVM
script.join("\n")
end
+ def litellm_decommission_script
+ script = []
+ script << 'set -euo pipefail'
+ script << 'sudo systemctl stop litellm 2>/dev/null || true'
+ script << 'sudo systemctl disable litellm 2>/dev/null || true'
+ script << 'sudo rm -f /etc/systemd/system/litellm.service'
+ script << 'sudo systemctl daemon-reload'
+ script << 'sudo rm -f /ephemeral/litellm-config.yaml'
+ script << 'sudo rm -rf /ephemeral/litellm-env'
+ script << 'sudo rm -f /ephemeral/litellm.log'
+ script << "sudo ufw --force delete allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port 4000 proto tcp >/dev/null 2>&1 || true"
+ script << 'echo litellm-decommission-ok'
+ script.join("\n")
+ end
+
private
def normalized_model_list(models)
@@ -1289,6 +1308,12 @@ module HyperstackVM
raise Error, "vLLM install failed: #{output.strip}" unless status.success?
end
+ def decommission_litellm(host)
+ info "Removing deprecated LiteLLM service from #{host} if present..."
+ output, status = @ssh_stream_runner.call(host, @scripts.litellm_decommission_script)
+ raise Error, "LiteLLM decommission failed: #{output.strip}" unless status.success?
+ end
+
def setup_vllm_stack(host, preset_config: nil)
install_vllm(host, preset_config: preset_config)
end
@@ -1508,6 +1533,8 @@ module HyperstackVM
host = state['public_ip']
raise Error, 'No public IP in state file.' if host.nil? || host.empty?
+ @provisioner.decommission_litellm(host)
+
# Stop the old container only when it has a different name from the new one.
if old_container != new_container
@provisioner.stop_vllm_container(host, old_container)
@@ -1574,6 +1601,7 @@ module HyperstackVM
@state_store.save(state)
wait_for_ssh(state['public_ip'])
+ @provisioner.decommission_litellm(state['public_ip'])
if @config.guest_bootstrap_enabled? && state['bootstrapped_at'].nil?
@provisioner.bootstrap_guest(state['public_ip'])
state['bootstrapped_at'] = Time.now.utc.iso8601
@@ -1730,13 +1758,27 @@ module HyperstackVM
end
def ensure_security_rules(vm)
- existing = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
+ existing_rules = Array(vm['security_rules'])
+ existing = existing_rules.map { |rule| normalize_rule(rule) }
desired = desired_security_rules.map { |rule| normalize_rule(rule) }
(desired - existing).each do |rule|
info "Adding Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..."
@client.create_vm_rule(vm['id'], rule)
end
+
+ legacy_litellm_rules(existing_rules).each do |rule|
+ rule_id = rule['id'] || rule['rule_id']
+ unless rule_id
+ warn 'Found legacy Hyperstack firewall rule for port 4000, but the API payload has no rule id; remove it manually from the Hyperstack console.'
+ next
+ end
+
+ info "Removing legacy Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..."
+ @client.delete_vm_rule(vm['id'], rule_id)
+ rescue Error => e
+ warn "Failed to remove legacy Hyperstack firewall rule #{rule_id}: #{e.message}"
+ end
end
def ollama_setup_needed?(state)
@@ -1998,6 +2040,16 @@ module HyperstackVM
desired_security_rules(include_vllm: state_vllm_enabled?(state), include_ollama: state_ollama_enabled?(state))
end
+ def legacy_litellm_rules(rules)
+ Array(rules).select do |rule|
+ normalized = normalize_rule(rule)
+ normalized['protocol'] == 'tcp' &&
+ normalized['port_range_min'] == 4000 &&
+ normalized['port_range_max'] == 4000 &&
+ normalized['remote_ip_prefix'] == @config.wireguard_subnet
+ end
+ end
+
def state_vllm_enabled?(state)
recorded = state&.dig('services', 'vllm_enabled')
return recorded unless recorded.nil?
diff --git a/pi/agent/extensions/loop-scheduler/README.md b/pi/agent/extensions/loop-scheduler/README.md
index 78a6635..65ab10f 100644
--- a/pi/agent/extensions/loop-scheduler/README.md
+++ b/pi/agent/extensions/loop-scheduler/README.md
@@ -2,7 +2,7 @@
Session-scoped recurring prompts for Pi.
-This extension adds a Claude-Code-style `/loop` command for interactive Pi
+This extension adds a recurring `/loop` command for interactive Pi
sessions. It schedules a prompt to be re-sent on an interval while the current
Pi process stays open.
diff --git a/vllm-setup.txt b/vllm-setup.txt
deleted file mode 100644
index 9ff424e..0000000
--- a/vllm-setup.txt
+++ /dev/null
@@ -1,322 +0,0 @@
-# vLLM Setup for Hyperstack VM
-#
-# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
-# via vLLM exposed directly on the OpenAI-compatible API.
-#
-# Architecture:
-#
-# Pi (earth) Hyperstack VM (A100 80GB)
-# ┌─────────────┐ ┌──────────────────────────────┐
-# │ pi │── OpenAI API ──────> │ vLLM engine (:11434) │
-# │ │ /v1/chat/completions│ FlashAttention v2 │
-# └─────────────┘ via WireGuard wg1 │ prefix caching │
-# │ bullpoint/Qwen3-Coder- │
-# │ Next-AWQ-4bit (45GB) │
-# └──────────────────────────────┘
-#
-# Why vLLM instead of Ollama:
-# - FlashAttention v2: ~1.5-2x faster prefill for long prompts
-# - Block-level prefix caching: partial KV cache reuse even when prompt
-# changes mid-sequence (Ollama requires exact prefix match from token 0)
-# - Chunked prefill: can interleave prefill and decode
-# - Marlin kernels for AWQ MoE quantization
-#
-# Model details:
-# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
-# - Architecture: MoE, 80B total params, 3B active per token
-# - 512 experts, 10 activated + 1 shared per token
-# - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
-# - Quantization: AWQ 4-bit, group size 32
-# - Disk size: ~45GB (vs ~151GB at BF16)
-# - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
-# - Context: 262,144 tokens (256k native)
-# - vLLM requirement: >= 0.15.0
-#
-# Hardware requirements:
-# - Minimum: 1x A100 80GB (PCIe or SXM)
-# - VRAM breakdown at gpu_memory_utilization=0.92:
-# Model weights: ~45 GiB
-# KV cache: ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
-# CUDA graphs: ~3 GiB
-# Total: ~75 GiB / 80 GiB
-#
-# Ports:
-# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
-# Restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
-
-# ===========================================================================
-# STEP 1: Prerequisites
-# ===========================================================================
-# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
-# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
-# - Ollama stopped and disabled if previously running:
-#
-# sudo systemctl stop ollama
-# sudo systemctl disable ollama
-
-# ===========================================================================
-# STEP 2: Storage setup
-# ===========================================================================
-# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
-# on some providers but not guaranteed — model will re-download if lost).
-#
-# sudo mkdir -p /ephemeral/hug
-# sudo chmod -R 0777 /ephemeral/hug
-
-# ===========================================================================
-# STEP 3: vLLM Docker container
-# ===========================================================================
-# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
-# After download, model loading takes ~65s and CUDA graph capture ~35s.
-# Total cold start: ~4-5 minutes.
-#
-# docker pull vllm/vllm-openai:latest
-#
-# docker run -d \
-# --gpus all \
-# --ipc=host \
-# --network host \
-# --name vllm_qwen3 \
-# --restart always \
-# -v /ephemeral/hug:/root/.cache/huggingface \
-# vllm/vllm-openai:latest \
-# --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
-# --tensor-parallel-size 1 \
-# --enable-auto-tool-choice \
-# --tool-call-parser qwen3_coder \
-# --enable-prefix-caching \
-# --gpu-memory-utilization 0.92 \
-# --max-model-len 262144 \
-# --host 0.0.0.0 \
-# --port 11434
-#
-# Flags explained:
-# --tensor-parallel-size 1 Single GPU (use 2/4 for multi-GPU setups)
-# --enable-auto-tool-choice Enables function/tool calling
-# --tool-call-parser qwen3_coder Parser for qwen3-coder tool format
-# --enable-prefix-caching Block-level KV cache reuse across requests
-# --gpu-memory-utilization 0.92 Use 92% of VRAM (rest for OS/overhead)
-# --max-model-len 262144 Full 256k context window
-# --port 11434 Reuse Ollama port for firewall compatibility
-#
-# Verify startup (wait for "Application startup complete"):
-# docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
-#
-# Verify model loaded:
-# curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Quick inference test:
-# curl -s http://localhost:11434/v1/chat/completions \
-# -H "Content-Type: application/json" \
-# -H "Authorization: Bearer EMPTY" \
-# -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
-# "messages":[{"role":"user","content":"Hello"}],
-# "max_tokens":50}'
-#
-# Monitor performance (prefix cache hit rate, throughput):
-# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-
-# ===========================================================================
-# STEP 4: Firewall rules
-# ===========================================================================
-# Allow access from WireGuard subnet only:
-#
-# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
-# comment 'vLLM via wg1'
-# ===========================================================================
-# STEP 5: Client configuration (on earth / local machine)
-# ===========================================================================
-#
-# Launch Pi or any OpenAI-compatible client directly against vLLM:
-#
-# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
-# OPENAI_API_KEY=EMPTY \
-# pi
-
-# ===========================================================================
-# STEP 6: Monitoring & troubleshooting
-# ===========================================================================
-#
-# --- Live engine stats ---
-# vLLM logs engine metrics every 10 seconds. Key fields:
-# - Avg prompt throughput: prefill speed (tokens/s), higher = faster
-# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
-# - GPU KV cache usage: % of KV cache memory in use (proportional to
-# active context length vs max capacity)
-# - Prefix cache hit rate: % of prompt tokens served from cache
-# - Running/Waiting: active and queued request counts
-#
-# Follow live (all stats):
-# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# Example output:
-# Engine 000: Avg prompt throughput: 5555.2 tokens/s,
-# Avg generation throughput: 49.4 tokens/s,
-# Running: 1 reqs, Waiting: 0 reqs,
-# GPU KV cache usage: 4.6%,
-# Prefix cache hit rate: 0.0%
-#
-# --- Request-level monitoring ---
-# See individual HTTP requests (method, status, duration):
-# docker logs -f vllm_qwen3 2>&1 | grep "POST"
-#
-# Example output:
-# 127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
-#
-# --- One-liner: last minute stats ---
-# Useful for periodic checks without following the log:
-# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# --- GPU hardware stats ---
-# Snapshot:
-# nvidia-smi
-#
-# Continuous (every 5 seconds):
-# nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
-# --format=csv -l 5
-#
-# --- Interpreting the stats ---
-#
-# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
-# Prefill throughput: 5,000-11,000 tok/s (bursts higher during batch prefill)
-# Decode throughput: 40-99 tok/s (varies with output length per sample)
-# KV cache usage: 0-5% for short conversations, grows with context
-# (100% = 298k tokens, at which point requests queue)
-# Prefix cache hit: depends on prompt reuse; higher is better
-# Temperature: 44-60C under load, <45C idle
-# Power: 70W idle, 230-240W under load, 300W max
-#
-# Warning signs:
-# - Waiting > 0 for extended periods → requests queuing, model overloaded
-# - KV cache usage near 100% → context too long, reduce --max-model-len
-# - Decode throughput < 20 tok/s sustained → possible thermal throttling
-# - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
-#
-# Common issues:
-#
-# 1. OOM on startup with --max-model-len 262144
-# → Reduce to 131072 or 65536
-#
-# 2. Prefix cache hit rate stays at 0%
-# → Normal when prompts vary heavily turn-to-turn
-#
-# 3. vLLM container won't start (CUDA version mismatch)
-# → Check driver version: nvidia-smi
-# → vLLM requires CUDA >= 12.x and driver >= 535
-
-# ===========================================================================
-# STEP 7: Loading / switching models
-# ===========================================================================
-#
-# vLLM serves one model per container. To switch models, stop the current
-# container and start a new one with different --model.
-#
-# --- Stop current model ---
-# docker stop vllm_qwen3
-# docker rm vllm_qwen3
-#
-# --- Run a different model ---
-# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
-# as needed. The HuggingFace model downloads automatically on first start.
-#
-# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
-#
-# docker run -d \
-# --gpus all \
-# --ipc=host \
-# --network host \
-# --name vllm_qwen3_30b \
-# --restart always \
-# -v /ephemeral/hug:/root/.cache/huggingface \
-# vllm/vllm-openai:latest \
-# --model Qwen/Qwen3-Coder-30B-AWQ \
-# --tensor-parallel-size 1 \
-# --enable-auto-tool-choice \
-# --tool-call-parser qwen3_coder \
-# --enable-prefix-caching \
-# --gpu-memory-utilization 0.92 \
-# --max-model-len 131072 \
-# --host 0.0.0.0 \
-# --port 11434
-#
-# Example: full-precision model on multi-GPU (e.g. 4x H100)
-#
-# docker run -d \
-# --gpus all \
-# --ipc=host \
-# --network host \
-# --name vllm_qwen3_fp16 \
-# --restart always \
-# -v /ephemeral/hug:/root/.cache/huggingface \
-# vllm/vllm-openai:latest \
-# --model Qwen/Qwen3-Coder-Next \
-# --tensor-parallel-size 4 \
-# --enable-auto-tool-choice \
-# --tool-call-parser qwen3_coder \
-# --enable-prefix-caching \
-# --gpu-memory-utilization 0.90 \
-# --max-model-len 262144 \
-# --host 0.0.0.0 \
-# --port 11434
-#
-# --- Finding models ---
-# Search HuggingFace for vLLM-compatible quantized models:
-# https://huggingface.co/models?search=<model-name>+awq
-# https://huggingface.co/models?search=<model-name>+gptq
-#
-# Supported quantization formats in vLLM:
-# - AWQ (recommended): fast Marlin kernels, good quality
-# - GPTQ: similar to AWQ, widely available
-# - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
-# - BF16/FP16: full precision, needs more VRAM
-#
-# --- VRAM sizing guide ---
-# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
-#
-# Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV)
-# ---------------------|----------------|-------------------------------
-# 7-8B | ~5 GiB | 262k+ (plenty of KV headroom)
-# 14B | ~9 GiB | 262k+ (plenty of KV headroom)
-# 30-32B | ~18 GiB | 262k (~57 GiB for KV cache)
-# 70-80B (MoE, 3B act) | ~45 GiB | 262k (~27 GiB for KV cache)
-# 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache)
-# 120B+ | won't fit | use multi-GPU or smaller quant
-#
-# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
-# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
-# to 0.85 or try a smaller model.
-#
-# --- Verifying the new model ---
-# Check loaded model:
-# curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Test inference:
-# curl -s http://localhost:11434/v1/chat/completions \
-# -H "Content-Type: application/json" \
-# -H "Authorization: Bearer EMPTY" \
-# -d '{"model":"<model-name>",
-# "messages":[{"role":"user","content":"Hello"}],
-# "max_tokens":50}'
-#
-# ===========================================================================
-# Performance characteristics (A100 80GB PCIe, single GPU)
-# ===========================================================================
-#
-# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
-#
-# vLLM prefill throughput: 5,000-11,000 tok/s (FlashAttention v2)
-# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited)
-# Per-turn latency: ~10-15s (small prompts, early conversation)
-# KV cache usage: 2-5% for typical coding sessions
-# Prefix cache hit rate: workload-dependent
-#
-# Comparison with Ollama on same hardware (A100 80GB PCIe):
-#
-# | Ollama (Q4_K_M) | vLLM (AWQ 4-bit)
-# -----------------------|-----------------------|----------------------
-# Prefill throughput | ~1,000 tok/s (est.) | 5,000-11,000 tok/s
-# Decode throughput | ~40 tok/s | 40-99 tok/s
-# Per-turn latency | ~28s (32k ctx) | ~10-15s
-# Context window | 32k (was truncating) | 262k (full, no truncation)
-# Prefix cache | workload-dependent | workload-dependent
-# VRAM usage | 52-61 GiB | 75 GiB (more KV cache)