Consolidate vllm-setup.txt into README.md and remove the file

Merged all still-relevant content from vllm-setup.txt into README.md: - Why vLLM over Ollama section - Full monitoring commands with engine metrics table - Troubleshooting table - VRAM sizing guide - Performance characteristics table Dropped LiteLLM, Anthropic API, Claude Code, and OpenCode sections which are no longer applicable. Removes the vllm-setup.txt file.
author: Paul Buetow <paul@buetow.org> 2026-03-21 12:46:30 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-21 12:46:30 +0200
commit: dd621fefb33ee006f8d2855caa9f88a268717a9a (patch)
tree: 45fec50ae97950cd2eb31bfeea41782919c865fa
parent: d54c42d6c6a9d559b912c7f5330397e236ea407b (diff)
4 files changed, 128 insertions, 325 deletions
diff --git a/README.md b/README.md
index 93ddc58..690490b 100644
--- a/README.md
+++ b/README.md
@@ -183,6 +183,13 @@ set `HYPERSTACK_OPERATOR_CIDR` to override that detection when needed.
 SSH host keys are pinned per state file in `<state>.known_hosts`. `delete` and `--replace`
 clear that trust file for intentional reprovisioning; unexpected host key changes now fail closed.
 
+## Why vLLM instead of Ollama
+
+- **FlashAttention v2**: ~1.5–2× faster prefill for long prompts
+- **Block-level prefix caching**: partial KV cache reuse even when the prompt changes mid-sequence (Ollama requires an exact prefix match from token 0)
+- **Chunked prefill**: can interleave prefill and decode
+- **Marlin kernels** for AWQ MoE quantization
+
 ## Monitoring vLLM
 
 ```bash
@@ -192,8 +199,23 @@ ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"'
 # GPU stats (every 5 s)
 ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
 
+# Last-minute stats (one-shot, no follow)
+ssh ubuntu@<vm-ip> 'docker logs --since 1m vllm_nemotron_super 2>&1 | grep "Engine 000"'
+
+# Request-level monitoring
+ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "POST"'
 ```
 
+Engine metrics key fields:
+
+| Field | Meaning |
+|-------|---------|
+| Avg prompt throughput | Prefill speed (tokens/s) — higher is faster |
+| Avg generation throughput | Decode speed (tokens/s) — ~40–99 on A100 PCIe |
+| GPU KV cache usage | % of KV cache memory in use (proportional to active context vs max capacity) |
+| Prefix cache hit rate | % of prompt tokens served from cache |
+| Running / Waiting | Active and queued request counts |
+
 Healthy baseline (A100 80GB PCIe):
 
 | Metric | Expected |
@@ -201,5 +223,56 @@ Healthy baseline (A100 80GB PCIe):
 | Prefill throughput | 5,000–11,000 tok/s |
 | Decode throughput | 40–99 tok/s |
 | KV cache usage | 2–5% for typical sessions |
+| Temperature | 44–60°C under load, <45°C idle |
+| Power | 70 W idle, 230–240 W under load, 300 W max |
+
+Warning signs:
+
+- **Waiting > 0 for extended periods** — requests queuing, model overloaded
+- **KV cache usage near 100%** — context too long, reduce `--max-model-len`
+- **Decode throughput < 20 tok/s sustained** — possible thermal throttling
+- **Prefill throughput < 2,000 tok/s** — check for CPU offload or driver issues
+
+## Troubleshooting
+
+| Problem | Fix |
+|---------|-----|
+| OOM on startup with `--max-model-len 262144` | Reduce to `131072` or `65536` |
+| Prefix cache hit rate stays at 0% | Normal when prompts vary heavily turn-to-turn |
+| vLLM container won't start (CUDA mismatch) | Check `nvidia-smi`; vLLM requires CUDA ≥ 12.x and driver ≥ 535 |
+| Still OOM after reducing context | Lower `gpu_memory_utilization` to `0.85` or use a smaller model |
+
+## VRAM sizing guide
+
+Rule of thumb for a single A100 80 GB at 92% utilization (~75 GiB usable):
+
+| Model size (params) | AWQ 4-bit VRAM | Max context (remaining for KV) |
+|---|---|---|
+| 7–8B | ~5 GiB | 262k+ (plenty of KV headroom) |
+| 14B | ~9 GiB | 262k+ (plenty of KV headroom) |
+| 30–32B | ~18 GiB | 262k (~57 GiB for KV cache) |
+| 70–80B (MoE, 3B active) | ~45 GiB | 262k (~27 GiB for KV cache) |
+| 70B (dense) | ~38 GiB | 131k (~37 GiB for KV cache) |
+| 120B+ | won't fit | use multi-GPU or smaller quant |
+
+Supported quantization formats:
+
+- **AWQ** (recommended): fast Marlin kernels, good quality
+- **GPTQ**: similar to AWQ, widely available
+- **FP8**: 8-bit, needs Hopper+ GPUs (H100/H200)
+- **BF16/FP16**: full precision, needs more VRAM
+
+Search HuggingFace for vLLM-compatible quantized models:
+`https://huggingface.co/models?search=<model-name>+awq`
+
+## Performance characteristics
+
+Measured on A100 80 GB PCIe (single GPU) with Qwen3-Coder-Next AWQ 4-bit:
 
-See `vllm-setup.txt` for detailed vLLM setup notes, VRAM sizing guide, and troubleshooting.
+| Metric | vLLM (AWQ 4-bit) | Ollama (Q4_K_M) |
+|--------|-------------------|-----------------|
+| Prefill throughput | 5,000–11,000 tok/s | ~1,000 tok/s (est.) |
+| Decode throughput | 40–99 tok/s | ~40 tok/s |
+| Per-turn latency | ~10–15 s | ~28 s (32k ctx) |
+| Context window | 262k (full, no truncation) | 32k (was truncating) |
+| VRAM usage | 75 GiB (more KV cache) | 52–61 GiB |
diff --git a/hyperstack.rb b/hyperstack.rb
index a3af491..f7bfe69 100755
--- a/hyperstack.rb
+++ b/hyperstack.rb
@@ -718,6 +718,10 @@ module HyperstackVM
       request(:post, "/core/virtual-machines/#{vm_id}/sg-rules", payload)
     end
 
+    def delete_vm_rule(vm_id, rule_id)
+      request(:delete, "/core/virtual-machines/#{vm_id}/sg-rules/#{rule_id}")
+    end
+
     private
 
     def request(method, path, payload = nil)
@@ -1225,6 +1229,21 @@ module HyperstackVM
       script.join("\n")
     end
 
+    def litellm_decommission_script
+      script = []
+      script << 'set -euo pipefail'
+      script << 'sudo systemctl stop litellm 2>/dev/null || true'
+      script << 'sudo systemctl disable litellm 2>/dev/null || true'
+      script << 'sudo rm -f /etc/systemd/system/litellm.service'
+      script << 'sudo systemctl daemon-reload'
+      script << 'sudo rm -f /ephemeral/litellm-config.yaml'
+      script << 'sudo rm -rf /ephemeral/litellm-env'
+      script << 'sudo rm -f /ephemeral/litellm.log'
+      script << "sudo ufw --force delete allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port 4000 proto tcp >/dev/null 2>&1 || true"
+      script << 'echo litellm-decommission-ok'
+      script.join("\n")
+    end
+
     private
 
     def normalized_model_list(models)
@@ -1289,6 +1308,12 @@ module HyperstackVM
       raise Error, "vLLM install failed: #{output.strip}" unless status.success?
     end
 
+    def decommission_litellm(host)
+      info "Removing deprecated LiteLLM service from #{host} if present..."
+      output, status = @ssh_stream_runner.call(host, @scripts.litellm_decommission_script)
+      raise Error, "LiteLLM decommission failed: #{output.strip}" unless status.success?
+    end
+
     def setup_vllm_stack(host, preset_config: nil)
       install_vllm(host, preset_config: preset_config)
     end
@@ -1508,6 +1533,8 @@ module HyperstackVM
       host = state['public_ip']
       raise Error, 'No public IP in state file.' if host.nil? || host.empty?
 
+      @provisioner.decommission_litellm(host)
+
       # Stop the old container only when it has a different name from the new one.
       if old_container != new_container
         @provisioner.stop_vllm_container(host, old_container)
@@ -1574,6 +1601,7 @@ module HyperstackVM
       @state_store.save(state)
 
       wait_for_ssh(state['public_ip'])
+      @provisioner.decommission_litellm(state['public_ip'])
       if @config.guest_bootstrap_enabled? && state['bootstrapped_at'].nil?
         @provisioner.bootstrap_guest(state['public_ip'])
         state['bootstrapped_at'] = Time.now.utc.iso8601
@@ -1730,13 +1758,27 @@ module HyperstackVM
     end
 
     def ensure_security_rules(vm)
-      existing = Array(vm['security_rules']).map { |rule| normalize_rule(rule) }
+      existing_rules = Array(vm['security_rules'])
+      existing = existing_rules.map { |rule| normalize_rule(rule) }
       desired = desired_security_rules.map { |rule| normalize_rule(rule) }
 
       (desired - existing).each do |rule|
         info "Adding Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..."
         @client.create_vm_rule(vm['id'], rule)
       end
+
+      legacy_litellm_rules(existing_rules).each do |rule|
+        rule_id = rule['id'] || rule['rule_id']
+        unless rule_id
+          warn 'Found legacy Hyperstack firewall rule for port 4000, but the API payload has no rule id; remove it manually from the Hyperstack console.'
+          next
+        end
+
+        info "Removing legacy Hyperstack firewall rule #{rule['protocol']} #{rule['remote_ip_prefix']} #{rule['port_range_min']}..."
+        @client.delete_vm_rule(vm['id'], rule_id)
+      rescue Error => e
+        warn "Failed to remove legacy Hyperstack firewall rule #{rule_id}: #{e.message}"
+      end
     end
 
     def ollama_setup_needed?(state)
@@ -1998,6 +2040,16 @@ module HyperstackVM
       desired_security_rules(include_vllm: state_vllm_enabled?(state), include_ollama: state_ollama_enabled?(state))
     end
 
+    def legacy_litellm_rules(rules)
+      Array(rules).select do |rule|
+        normalized = normalize_rule(rule)
+        normalized['protocol'] == 'tcp' &&
+          normalized['port_range_min'] == 4000 &&
+          normalized['port_range_max'] == 4000 &&
+          normalized['remote_ip_prefix'] == @config.wireguard_subnet
+      end
+    end
+
     def state_vllm_enabled?(state)
       recorded = state&.dig('services', 'vllm_enabled')
       return recorded unless recorded.nil?
diff --git a/pi/agent/extensions/loop-scheduler/README.md b/pi/agent/extensions/loop-scheduler/README.md
index 78a6635..65ab10f 100644
--- a/pi/agent/extensions/loop-scheduler/README.md
+++ b/pi/agent/extensions/loop-scheduler/README.md
@@ -2,7 +2,7 @@
 
 Session-scoped recurring prompts for Pi.
 
-This extension adds a Claude-Code-style `/loop` command for interactive Pi
+This extension adds a recurring `/loop` command for interactive Pi
 sessions. It schedules a prompt to be re-sent on an interval while the current
 Pi process stays open.
 
diff --git a/vllm-setup.txt b/vllm-setup.txt
deleted file mode 100644
index 9ff424e..0000000
--- a/vllm-setup.txt
+++ /dev/null
@@ -1,322 +0,0 @@
-# vLLM Setup for Hyperstack VM
-#
-# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
-# via vLLM exposed directly on the OpenAI-compatible API.
-#
-# Architecture:
-#
-#   Pi (earth)                            Hyperstack VM (A100 80GB)
-#   ┌─────────────┐                       ┌──────────────────────────────┐
-#   │ pi          │── OpenAI API ──────>  │ vLLM engine (:11434)        │
-#   │             │   /v1/chat/completions│   FlashAttention v2         │
-#   └─────────────┘   via WireGuard wg1   │   prefix caching            │
-#                                         │   bullpoint/Qwen3-Coder-    │
-#                                         │     Next-AWQ-4bit (45GB)    │
-#                                         └──────────────────────────────┘
-#
-# Why vLLM instead of Ollama:
-#   - FlashAttention v2: ~1.5-2x faster prefill for long prompts
-#   - Block-level prefix caching: partial KV cache reuse even when prompt
-#     changes mid-sequence (Ollama requires exact prefix match from token 0)
-#   - Chunked prefill: can interleave prefill and decode
-#   - Marlin kernels for AWQ MoE quantization
-#
-# Model details:
-#   - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
-#   - Architecture: MoE, 80B total params, 3B active per token
-#   - 512 experts, 10 activated + 1 shared per token
-#   - Hybrid attention: Gated DeltaNet + Gated Attention (48 layers)
-#   - Quantization: AWQ 4-bit, group size 32
-#   - Disk size: ~45GB (vs ~151GB at BF16)
-#   - VRAM usage: ~45GB weights + ~27GB KV cache at 92% utilization
-#   - Context: 262,144 tokens (256k native)
-#   - vLLM requirement: >= 0.15.0
-#
-# Hardware requirements:
-#   - Minimum: 1x A100 80GB (PCIe or SXM)
-#   - VRAM breakdown at gpu_memory_utilization=0.92:
-#       Model weights:  ~45 GiB
-#       KV cache:       ~27 GiB (298k tokens capacity, 4.49x concurrency at 262k)
-#       CUDA graphs:    ~3 GiB
-#       Total:          ~75 GiB / 80 GiB
-#
-# Ports:
-#   11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
-#   Restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
-
-# ===========================================================================
-# STEP 1: Prerequisites
-# ===========================================================================
-# - VM with NVIDIA GPU, CUDA drivers, and Docker with nvidia-container-toolkit
-# - WireGuard wg1 tunnel already configured (see wg1-setup.sh)
-# - Ollama stopped and disabled if previously running:
-#
-#   sudo systemctl stop ollama
-#   sudo systemctl disable ollama
-
-# ===========================================================================
-# STEP 2: Storage setup
-# ===========================================================================
-# HuggingFace model cache on ephemeral storage (fast NVMe, survives reboots
-# on some providers but not guaranteed — model will re-download if lost).
-#
-#   sudo mkdir -p /ephemeral/hug
-#   sudo chmod -R 0777 /ephemeral/hug
-
-# ===========================================================================
-# STEP 3: vLLM Docker container
-# ===========================================================================
-# Pull and run vLLM. The model downloads on first start (~45GB, ~2.5 min).
-# After download, model loading takes ~65s and CUDA graph capture ~35s.
-# Total cold start: ~4-5 minutes.
-#
-#   docker pull vllm/vllm-openai:latest
-#
-#   docker run -d \
-#     --gpus all \
-#     --ipc=host \
-#     --network host \
-#     --name vllm_qwen3 \
-#     --restart always \
-#     -v /ephemeral/hug:/root/.cache/huggingface \
-#     vllm/vllm-openai:latest \
-#     --model bullpoint/Qwen3-Coder-Next-AWQ-4bit \
-#     --tensor-parallel-size 1 \
-#     --enable-auto-tool-choice \
-#     --tool-call-parser qwen3_coder \
-#     --enable-prefix-caching \
-#     --gpu-memory-utilization 0.92 \
-#     --max-model-len 262144 \
-#     --host 0.0.0.0 \
-#     --port 11434
-#
-# Flags explained:
-#   --tensor-parallel-size 1    Single GPU (use 2/4 for multi-GPU setups)
-#   --enable-auto-tool-choice   Enables function/tool calling
-#   --tool-call-parser qwen3_coder   Parser for qwen3-coder tool format
-#   --enable-prefix-caching     Block-level KV cache reuse across requests
-#   --gpu-memory-utilization 0.92   Use 92% of VRAM (rest for OS/overhead)
-#   --max-model-len 262144      Full 256k context window
-#   --port 11434                Reuse Ollama port for firewall compatibility
-#
-# Verify startup (wait for "Application startup complete"):
-#   docker logs -f vllm_qwen3 2>&1 | grep -E "startup complete|Error"
-#
-# Verify model loaded:
-#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Quick inference test:
-#   curl -s http://localhost:11434/v1/chat/completions \
-#     -H "Content-Type: application/json" \
-#     -H "Authorization: Bearer EMPTY" \
-#     -d '{"model":"bullpoint/Qwen3-Coder-Next-AWQ-4bit",
-#          "messages":[{"role":"user","content":"Hello"}],
-#          "max_tokens":50}'
-#
-# Monitor performance (prefix cache hit rate, throughput):
-#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-
-# ===========================================================================
-# STEP 4: Firewall rules
-# ===========================================================================
-# Allow access from WireGuard subnet only:
-#
-#   sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
-#     comment 'vLLM via wg1'
-# ===========================================================================
-# STEP 5: Client configuration (on earth / local machine)
-# ===========================================================================
-#
-# Launch Pi or any OpenAI-compatible client directly against vLLM:
-#
-#   OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
-#   OPENAI_API_KEY=EMPTY \
-#   pi
-
-# ===========================================================================
-# STEP 6: Monitoring & troubleshooting
-# ===========================================================================
-#
-# --- Live engine stats ---
-# vLLM logs engine metrics every 10 seconds. Key fields:
-#   - Avg prompt throughput:     prefill speed (tokens/s), higher = faster
-#   - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
-#   - GPU KV cache usage:        % of KV cache memory in use (proportional to
-#                                 active context length vs max capacity)
-#   - Prefix cache hit rate:     % of prompt tokens served from cache
-#   - Running/Waiting:           active and queued request counts
-#
-# Follow live (all stats):
-#   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# Example output:
-#   Engine 000: Avg prompt throughput: 5555.2 tokens/s,
-#               Avg generation throughput: 49.4 tokens/s,
-#               Running: 1 reqs, Waiting: 0 reqs,
-#               GPU KV cache usage: 4.6%,
-#               Prefix cache hit rate: 0.0%
-#
-# --- Request-level monitoring ---
-# See individual HTTP requests (method, status, duration):
-#   docker logs -f vllm_qwen3 2>&1 | grep "POST"
-#
-# Example output:
-#   127.0.0.1:41864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
-#
-# --- One-liner: last minute stats ---
-# Useful for periodic checks without following the log:
-#   docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
-#
-# --- GPU hardware stats ---
-# Snapshot:
-#   nvidia-smi
-#
-# Continuous (every 5 seconds):
-#   nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used \
-#     --format=csv -l 5
-#
-# --- Interpreting the stats ---
-#
-# Healthy baseline (A100 80GB PCIe, qwen3-coder-next AWQ 4-bit):
-#   Prefill throughput:   5,000-11,000 tok/s (bursts higher during batch prefill)
-#   Decode throughput:    40-99 tok/s (varies with output length per sample)
-#   KV cache usage:       0-5% for short conversations, grows with context
-#                         (100% = 298k tokens, at which point requests queue)
-#   Prefix cache hit:     depends on prompt reuse; higher is better
-#   Temperature:          44-60C under load, <45C idle
-#   Power:                70W idle, 230-240W under load, 300W max
-#
-# Warning signs:
-#   - Waiting > 0 for extended periods → requests queuing, model overloaded
-#   - KV cache usage near 100% → context too long, reduce --max-model-len
-#   - Decode throughput < 20 tok/s sustained → possible thermal throttling
-#   - Prefill throughput < 2,000 tok/s → check for CPU offload or driver issues
-#
-# Common issues:
-#
-# 1. OOM on startup with --max-model-len 262144
-#    → Reduce to 131072 or 65536
-#
-# 2. Prefix cache hit rate stays at 0%
-#    → Normal when prompts vary heavily turn-to-turn
-#
-# 3. vLLM container won't start (CUDA version mismatch)
-#    → Check driver version: nvidia-smi
-#    → vLLM requires CUDA >= 12.x and driver >= 535
-
-# ===========================================================================
-# STEP 7: Loading / switching models
-# ===========================================================================
-#
-# vLLM serves one model per container. To switch models, stop the current
-# container and start a new one with different --model.
-#
-# --- Stop current model ---
-#   docker stop vllm_qwen3
-#   docker rm vllm_qwen3
-#
-# --- Run a different model ---
-# Replace --model, --name, and adjust --max-model-len and --tool-call-parser
-# as needed. The HuggingFace model downloads automatically on first start.
-#
-# Example: qwen3-coder:30b (smaller, faster, fits easily on A100 80GB)
-#
-#   docker run -d \
-#     --gpus all \
-#     --ipc=host \
-#     --network host \
-#     --name vllm_qwen3_30b \
-#     --restart always \
-#     -v /ephemeral/hug:/root/.cache/huggingface \
-#     vllm/vllm-openai:latest \
-#     --model Qwen/Qwen3-Coder-30B-AWQ \
-#     --tensor-parallel-size 1 \
-#     --enable-auto-tool-choice \
-#     --tool-call-parser qwen3_coder \
-#     --enable-prefix-caching \
-#     --gpu-memory-utilization 0.92 \
-#     --max-model-len 131072 \
-#     --host 0.0.0.0 \
-#     --port 11434
-#
-# Example: full-precision model on multi-GPU (e.g. 4x H100)
-#
-#   docker run -d \
-#     --gpus all \
-#     --ipc=host \
-#     --network host \
-#     --name vllm_qwen3_fp16 \
-#     --restart always \
-#     -v /ephemeral/hug:/root/.cache/huggingface \
-#     vllm/vllm-openai:latest \
-#     --model Qwen/Qwen3-Coder-Next \
-#     --tensor-parallel-size 4 \
-#     --enable-auto-tool-choice \
-#     --tool-call-parser qwen3_coder \
-#     --enable-prefix-caching \
-#     --gpu-memory-utilization 0.90 \
-#     --max-model-len 262144 \
-#     --host 0.0.0.0 \
-#     --port 11434
-#
-# --- Finding models ---
-# Search HuggingFace for vLLM-compatible quantized models:
-#   https://huggingface.co/models?search=<model-name>+awq
-#   https://huggingface.co/models?search=<model-name>+gptq
-#
-# Supported quantization formats in vLLM:
-#   - AWQ (recommended): fast Marlin kernels, good quality
-#   - GPTQ: similar to AWQ, widely available
-#   - FP8: 8-bit, needs Hopper+ GPUs (H100/H200)
-#   - BF16/FP16: full precision, needs more VRAM
-#
-# --- VRAM sizing guide ---
-# Rule of thumb for single A100 80GB at 92% utilization (~75 GiB usable):
-#
-#   Model size (params)  | AWQ 4-bit VRAM | Max context (remaining for KV)
-#   ---------------------|----------------|-------------------------------
-#   7-8B                 | ~5 GiB         | 262k+ (plenty of KV headroom)
-#   14B                  | ~9 GiB         | 262k+ (plenty of KV headroom)
-#   30-32B               | ~18 GiB        | 262k  (~57 GiB for KV cache)
-#   70-80B (MoE, 3B act) | ~45 GiB        | 262k  (~27 GiB for KV cache)
-#   70B (dense)          | ~38 GiB        | 131k  (~37 GiB for KV cache)
-#   120B+                | won't fit      | use multi-GPU or smaller quant
-#
-# If vLLM OOMs on startup, reduce --max-model-len first (halving it roughly
-# halves KV cache memory). If still OOM, reduce --gpu-memory-utilization
-# to 0.85 or try a smaller model.
-#
-# --- Verifying the new model ---
-# Check loaded model:
-#   curl -s http://localhost:11434/v1/models | python3 -m json.tool
-#
-# Test inference:
-#   curl -s http://localhost:11434/v1/chat/completions \
-#     -H "Content-Type: application/json" \
-#     -H "Authorization: Bearer EMPTY" \
-#     -d '{"model":"<model-name>",
-#          "messages":[{"role":"user","content":"Hello"}],
-#          "max_tokens":50}'
-#
-# ===========================================================================
-# Performance characteristics (A100 80GB PCIe, single GPU)
-# ===========================================================================
-#
-# Measured on 2026-03-16 with bullpoint/Qwen3-Coder-Next-AWQ-4bit:
-#
-#   vLLM prefill throughput:    5,000-11,000 tok/s (FlashAttention v2)
-#   vLLM decode throughput:     40-99 tok/s (memory-bandwidth limited)
-#   Per-turn latency:           ~10-15s (small prompts, early conversation)
-#   KV cache usage:             2-5% for typical coding sessions
-#   Prefix cache hit rate:      workload-dependent
-#
-# Comparison with Ollama on same hardware (A100 80GB PCIe):
-#
-#                          | Ollama (Q4_K_M)       | vLLM (AWQ 4-bit)
-#   -----------------------|-----------------------|----------------------
-#   Prefill throughput     | ~1,000 tok/s (est.)   | 5,000-11,000 tok/s
-#   Decode throughput      | ~40 tok/s             | 40-99 tok/s
-#   Per-turn latency       | ~28s (32k ctx)        | ~10-15s
-#   Context window         | 32k (was truncating)  | 262k (full, no truncation)
-#   Prefix cache           | workload-dependent    | workload-dependent
-#   VRAM usage             | 52-61 GiB             | 75 GiB (more KV cache)
author	Paul Buetow <paul@buetow.org>	2026-03-21 12:46:30 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-21 12:46:30 +0200
commit	dd621fefb33ee006f8d2855caa9f88a268717a9a (patch)
tree	45fec50ae97950cd2eb31bfeea41782919c865fa
parent	d54c42d6c6a9d559b912c7f5330397e236ea407b (diff)