ops: add AGENTS.md with startup runbook and known issues

Documents operational gotchas encountered during VM lifecycle: - A100 first / H100 fallback procedure (manual for now) - Docker image pull EOF failures and idempotent resume via create - WireGuard 'already exists' conflict between setup script and systemd - WireGuard public key mismatch after VM recreate (live fix commands) - vLLM startup phase timing and log signals - State file inspection and Hyperstack firewall IP drift Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
author: Paul Buetow <paul@buetow.org> 2026-03-25 21:28:36 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-25 21:28:36 +0200
commit: 3a795bda8684e55e2a754f47d1470eacc0667814 (patch)
tree: 3ee527f2251a9e4df6ffb42fc5344f7ba1bed75d
parent: 60db19bc70b4dcd1486030129c95cfff8e91b11d (diff)
1 files changed, 266 insertions, 0 deletions
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000..0631fde
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,266 @@
+# AGENTS.md — Operational runbook for hypr
+
+This file documents known startup issues, workarounds, and diagnostic procedures
+for the Hyperstack VM + WireGuard + vLLM setup. See README.md for architecture and
+configuration reference.
+
+---
+
+## GPU flavor availability: A100 first, H100 fallback
+
+The A100 80GB (`n3-A100x1`) is cheaper but sometimes sold out.
+The H100 80GB (`n3-H100x1`) is the fallback.
+
+**Manual fallback procedure:**
+
+1. Edit the TOML and try A100 first:
+   ```toml
+   # hyperstack-vm2.toml
+   flavor_name = "n3-A100x1"
+   ```
+2. Run `ruby hyperstack.rb --config hyperstack-vm2.toml create`.
+3. If the API returns a flavor-not-available error, switch to H100:
+   ```toml
+   flavor_name = "n3-H100x1"
+   ```
+4. Re-run create. The state file is only written after the VM is successfully created,
+   so a failed create leaves nothing to clean up.
+
+Both GPUs have 80 GB VRAM and run all presets identically.
+The TOML comment above `flavor_name` tracks the current choice.
+
+---
+
+## Docker image pull failures (transient EOF)
+
+The `vllm/vllm-openai:latest` image is ~20 GB. Docker Hub occasionally drops the
+connection mid-layer with:
+
+```
+failed to extract layer ... EOF
+```
+
+The provisioner retries twice automatically. If all attempts fail, just re-run create:
+
+```bash
+ruby hyperstack.rb --config hyperstack-vm2.toml create
+```
+
+The VM already exists and is tracked in the state file; `create` resumes from where it
+left off (skips VM creation, goes straight to vLLM setup). Docker will retry the pull
+from scratch and usually succeeds on the next attempt.
+
+---
+
+## WireGuard tunnel issues on first VM start
+
+After `create`, the WireGuard tunnel is set up by `wg1-setup.sh`. Several things can
+go wrong on the first attempt.
+
+### Symptom: `wg1 already exists`
+
+The systemd service fails with:
+
+```
+wg-quick: `wg1' already exists
+```
+
+This means the interface was brought up manually by the setup script but systemd
+subsequently tried to bring it up again and failed. The interface is actually running,
+but systemd thinks the service is failed.
+
+**Fix:**
+
+```bash
+# Check the interface is actually up
+sudo wg show wg1
+
+# If the peer is listed correctly, just reload systemd state
+sudo systemctl reset-failed wg-quick@wg1
+
+# If the peer is wrong or missing, drop and restart
+sudo ip link delete wg1
+sudo systemctl start wg-quick@wg1
+```
+
+### Symptom: tunnel up but no handshake (0 bytes received)
+
+```bash
+sudo wg show wg1 latest-handshakes
+# shows timestamp 0 for a peer
+```
+
+The most common cause after recreating a VM is a **stale public key** in the local
+`/etc/wireguard/wg1.conf`. When a VM is deleted and recreated, it generates fresh
+WireGuard keys. The setup script writes the new key, but if the script ran with
+errors (e.g. WireGuard retry failures during `create`), the local conf may still
+contain the old VM's public key.
+
+**Diagnose:**
+
+```bash
+# Get the VM's actual current public key
+ssh ubuntu@<vm-public-ip> 'sudo wg show wg1 public-key'
+
+# Compare to what's in the local conf
+grep PublicKey /etc/wireguard/wg1.conf
+```
+
+**Fix a key mismatch:**
+
+```bash
+# Replace the stale key in the local conf (substitute correct values)
+STALE_KEY="<old-key-from-conf>"
+NEW_KEY="<actual-key-from-vm>"
+VM_IP="<vm-public-ip>"
+VM_WG_IP="192.168.3.3"   # .1 for VM1, .3 for VM2
+
+sudo sed -i "s|PublicKey = ${STALE_KEY}|PublicKey = ${NEW_KEY}|" /etc/wireguard/wg1.conf
+
+# Apply the new peer live without restarting the interface
+sudo wg set wg1 peer ${NEW_KEY} endpoint ${VM_IP}:56710 \
+    allowed-ips ${VM_WG_IP}/32 persistent-keepalive 25
+
+# Remove the stale peer entry
+sudo wg set wg1 peer ${STALE_KEY} remove
+
+# Verify handshake within ~5 s
+sleep 5 && sudo wg show wg1 latest-handshakes
+```
+
+### Verify the tunnel end-to-end
+
+After fixing any of the above:
+
+```bash
+# 1. Confirm handshake timestamp is recent (non-zero, within last 30 s)
+sudo wg show wg1 latest-handshakes
+
+# 2. Ping through the tunnel
+ping -c 3 192.168.3.3   # VM2; use 192.168.3.1 for VM1
+
+# 3. Confirm vLLM is reachable over the tunnel
+curl -s http://192.168.3.3:11434/v1/models | python3 -c \
+    "import sys,json; print([m['id'] for m in json.load(sys.stdin)['data']])"
+
+# 4. Full automated test
+ruby hyperstack.rb --config hyperstack-vm2.toml test
+```
+
+Note: `curl` to the public IP will time out — port 11434 is firewalled to
+the WireGuard subnet (`192.168.3.0/24`) only. Always use the WireGuard IP.
+
+---
+
+## vLLM container startup sequence
+
+After the Docker container starts, the model goes through several phases before
+inference is ready. On an A100 with a warm HuggingFace cache:
+
+| Phase | Duration | Log signal |
+|-------|----------|------------|
+| Docker pull (first time) | ~2–3 min | Layer progress bars |
+| Model download from HuggingFace (first time) | ~3–5 min | `Downloading...` |
+| Weight loading | ~47 s | `Loading safetensors checkpoint shards: 100%` |
+| torch.compile + CUDA graph capture | ~1–2 min | `torch.compile took X s` |
+| **Ready** | — | `Application startup complete.` |
+
+**Monitor startup:**
+
+```bash
+ssh ubuntu@<vm-public-ip> 'sudo docker logs -f vllm_qwen3 2>&1' \
+    | grep -E "startup complete|Error|Loading|Downloading"
+```
+
+After `Application startup complete.`, the model responds immediately.
+If the container crashes before that line, check for CUDA errors:
+
+```bash
+ssh ubuntu@<vm-public-ip> 'sudo docker logs vllm_qwen3 2>&1 | grep -i "error\|cuda"'
+```
+
+A `CUDA error: operation not permitted` on the first engine process (pid visible in
+logs) is harmless if a second engine process starts successfully right after — vLLM
+retries internally.
+
+---
+
+## Resuming a failed `create`
+
+If `create` exits non-zero partway through (e.g. WireGuard retries exhausted, Docker
+EOF), the VM is still running and the state file tracks it. Simply re-run:
+
+```bash
+ruby hyperstack.rb --config hyperstack-vm2.toml create
+```
+
+The script checks `vllm_setup_at` and `bootstrapped_at` in the state file and skips
+already-completed steps. Typical resume flow:
+
+- VM already exists → skips VM creation
+- `bootstrapped_at` set → skips guest bootstrap
+- `vllm_setup_at` nil → runs vLLM Docker setup
+
+If you want to force a full reprovision from scratch:
+
+```bash
+ruby hyperstack.rb --config hyperstack-vm2.toml create --replace
+```
+
+This deletes the existing VM, clears the state file, and starts over.
+
+---
+
+## WireGuard firewall rules (Hyperstack security group)
+
+Port 56710/udp and port 22/tcp are locked to `allowed_wireguard_cidrs` and
+`allowed_ssh_cidrs` respectively. These default to `["auto"]`, which resolves your
+current public egress IPv4 at creation time.
+
+If your IP changes after creation (e.g. ISP reassignment), the Hyperstack firewall
+will block your handshake attempts silently. Symptoms: UDP reachable via `nc -zu` but
+WireGuard still shows 0 bytes received and timestamp 0.
+
+**Check what IP is in the Hyperstack rule:**
+
+```bash
+python3 -c "
+import json
+d = json.load(open('.hyperstack-vm2-state.json'))
+for r in d.get('security_rules', []):
+    if r.get('port_range_min') in (22, 56710):
+        print(r['protocol'], r['port_range_min'], r['remote_ip_prefix'])
+"
+```
+
+**Check your current IPv4:**
+
+```bash
+curl -s -4 https://ifconfig.me
+```
+
+If they differ, update the Hyperstack security group via the web console or re-run
+`create --replace` so the rules are refreshed with the new IP.
+
+---
+
+## Checking the state file
+
+The JSON state file (`.hyperstack-vm2-state.json` for VM2) is the source of truth
+for provisioning state. Key fields:
+
+```bash
+python3 -c "
+import json
+d = json.load(open('.hyperstack-vm2-state.json'))
+print('vm_id:          ', d.get('vm_id'))
+print('public_ip:      ', d.get('public_ip'))
+print('bootstrapped_at:', d.get('bootstrapped_at'))
+print('vllm_setup_at:  ', d.get('vllm_setup_at'))
+print('vllm_model:     ', d.get('vllm_model'))
+"
+```
+
+If `vllm_setup_at` is `None` but the container is running, the provisioner did not
+mark setup as complete (likely a transient error at the end of `create`). Re-running
+`create` will redo the vLLM step.
author	Paul Buetow <paul@buetow.org>	2026-03-25 21:28:36 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-25 21:28:36 +0200
commit	3a795bda8684e55e2a754f47d1470eacc0667814 (patch)
tree	3ee527f2251a9e4df6ffb42fc5344f7ba1bed75d
parent	60db19bc70b4dcd1486030129c95cfff8e91b11d (diff)