summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-03-21 10:49:35 +0200
committerPaul Buetow <paul@buetow.org>2026-03-21 10:49:35 +0200
commitea0f9f7f51b32f0c392f75aa0cc3231211f54757 (patch)
tree378d01dbc87dc0ef9f4fbd6ec7788e0a62f66876
parent4baa087445a11b856139f55adab262fa97384033 (diff)
Remove LiteLLM and Claude Code repo references (task 301)
-rw-r--r--README.md46
-rw-r--r--hyperstack-vm.toml14
-rw-r--r--hyperstack-vm1.toml12
-rw-r--r--hyperstack-vm2.toml12
-rwxr-xr-xhyperstack.rb173
-rw-r--r--pi/agent/extensions/btw/README.md2
-rw-r--r--vllm-setup.txt203
7 files changed, 41 insertions, 421 deletions
diff --git a/README.md b/README.md
index cba656c..93ddc58 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
# hyperstack
-Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference.
Runs two A100 VMs concurrently — each serving a different model — with [Pi](https://pi.dev) coding agents connected to each.
## Architecture
@@ -18,9 +18,6 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi](
│ vLLM (:11434) │ │ vLLM (:11434) │
│ Nemotron-3-Super 120B │ │ Qwen3-Coder-Next 80B (MoE) │
│ (hybrid Mamba+MoE, AWQ-4b) │ │ (AWQ-4bit) │
- │ │ │ │
- │ LiteLLM (:4000) │ │ LiteLLM (:4000) │
- │ Anthropic API → OpenAI │ │ Anthropic API → OpenAI │
└──────────────────────────────┘ └──────────────────────────────────┘
▲ ▲
│ OpenAI /v1/chat/completions │ OpenAI /v1/chat/completions
@@ -33,7 +30,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi](
```
Both VMs share a single WireGuard interface (`wg1`) on the local machine.
-Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation.
+Each VM runs one vLLM model exposed directly to Pi over the OpenAI-compatible API.
## Prerequisites
@@ -49,7 +46,7 @@ Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation.
## Quickstart (two-VM setup)
```bash
-# Deploy both VMs in parallel, set up WireGuard + vLLM + LiteLLM (~10 min)
+# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min)
ruby hyperstack.rb create-both
# Verify both VMs are working
@@ -144,33 +141,6 @@ Available presets (both VMs share the same set):
| `qwen3-32b` | Qwen3-32B (AWQ) | ~18 GB | 32K |
| `devstral` | Devstral-Small-2507 (AWQ-4bit) | ~15 GB | 32K |
-## Using Claude Code with vLLM
-
-WireGuard (`wg1`) must be active before connecting.
-
-```bash
-ANTHROPIC_BASE_URL=http://hyperstack1.wg1:4000 \
-ANTHROPIC_API_KEY=sk-litellm-master \
-claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
-```
-
-If you see an **"Auth conflict"** warning, clear the saved claude.ai session first:
-
-```bash
-claude /logout
-```
-
-**Available model aliases** — all map to the same vLLM model on that VM:
-
-| Alias | Use case |
-|-------|----------|
-| `claude-opus-4-6-20260604` | Recommended (most future-proof) |
-| `claude-opus-4-20250514` | |
-| `claude-sonnet-4-20250514` | |
-| `claude-haiku-3-5-20241022` | |
-
-Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in the TOML as they are released.
-
## CLI reference
```
@@ -182,13 +152,13 @@ Commands:
delete Destroy the tracked VM
delete-both Destroy both VM1 and VM2
status Show VM and WireGuard status
- test Run end-to-end inference tests (vLLM + LiteLLM)
+ test Run end-to-end inference tests (vLLM)
model switch <preset> Hot-switch the running vLLM model
create / create-both options:
--replace Delete existing tracked VM before creating
--dry-run Print the plan without making changes
- --vllm / --no-vllm Override config: enable/disable vLLM+LiteLLM setup
+ --vllm / --no-vllm Override config: enable/disable vLLM setup
--ollama / --no-ollama Override config: enable/disable Ollama setup
```
@@ -200,7 +170,7 @@ Key sections:
| Section | Purpose |
|---------|---------|
| `[vm]` | Flavor, image, environment name |
-| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[vllm]` | Model, container settings, and vLLM runtime options |
| `[vllm.presets.*]` | Named model presets for hot-switching |
| `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
| `[network]` | Ports, WireGuard subnet, allowed CIDRs |
@@ -222,8 +192,6 @@ ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"'
# GPU stats (every 5 s)
ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
-# LiteLLM proxy log
-ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
```
Healthy baseline (A100 80GB PCIe):
@@ -234,4 +202,4 @@ Healthy baseline (A100 80GB PCIe):
| Decode throughput | 40–99 tok/s |
| KV cache usage | 2–5% for typical sessions |
-See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
+See `vllm-setup.txt` for detailed vLLM setup notes, VRAM sizing guide, and troubleshooting.
diff --git a/hyperstack-vm.toml b/hyperstack-vm.toml
index e82c97f..28de975 100644
--- a/hyperstack-vm.toml
+++ b/hyperstack-vm.toml
@@ -37,8 +37,6 @@ allowed_ssh_cidrs = ["auto"]
allowed_wireguard_cidrs = ["auto"]
# Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
ollama_port = 11434
-# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
-litellm_port = 4000
[bootstrap]
enable_guest_bootstrap = true
@@ -56,7 +54,7 @@ num_parallel = 1
context_length = 32768
pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
-# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# vLLM serves one model via Docker on the OpenAI-compatible API.
# Use --vllm / --no-vllm CLI flags to override install at runtime.
[vllm]
install = true
@@ -68,14 +66,6 @@ max_model_len = 262144
gpu_memory_utilization = 0.92
tensor_parallel_size = 1
tool_call_parser = "qwen3_coder"
-# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
-litellm_master_key = "sk-litellm-master"
-litellm_claude_model_names = [
- "claude-sonnet-4-20250514",
- "claude-opus-4-20250514",
- "claude-opus-4-6-20260604",
- "claude-haiku-3-5-20241022"
-]
# Named model presets for 'ruby hyperstack.rb model switch <name>'.
# Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
@@ -127,7 +117,7 @@ tool_call_parser = ""
# OpenAI GPT-OSS 120B — powerful MoE (5.1B active / 117B total, MXFP4), ~65 GB on A100.
# Hard architecture limit: max_position_embeddings=131072 in model config.json.
# 131072 is the absolute ceiling — exceeding it causes NaN or CUDA OOB errors.
-# For sessions approaching this limit, start a fresh opencode conversation.
+# For sessions approaching this limit, start a fresh Pi conversation.
# tool_call_parser = "" disables --enable-auto-tool-choice (same reason as gpt-oss-20b).
[vllm.presets.gpt-oss-120b]
model = "openai/gpt-oss-120b"
diff --git a/hyperstack-vm1.toml b/hyperstack-vm1.toml
index 1b116bd..6109472 100644
--- a/hyperstack-vm1.toml
+++ b/hyperstack-vm1.toml
@@ -41,8 +41,6 @@ allowed_ssh_cidrs = ["auto"]
allowed_wireguard_cidrs = ["auto"]
# Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
ollama_port = 11434
-# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
-litellm_port = 4000
[bootstrap]
enable_guest_bootstrap = true
@@ -60,7 +58,7 @@ num_parallel = 1
context_length = 32768
pull_models = ["nemotron-3-super"]
-# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# vLLM serves one model via Docker on the OpenAI-compatible API.
# VM1 defaults to nemotron-3-super; use 'model switch' to load any other preset.
[vllm]
install = true
@@ -75,14 +73,6 @@ tensor_parallel_size = 1
tool_call_parser = "qwen3_xml"
trust_remote_code = true
extra_vllm_args = ["--reasoning-parser", "nemotron_v3"]
-# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
-litellm_master_key = "sk-litellm-master"
-litellm_claude_model_names = [
- "claude-sonnet-4-20250514",
- "claude-opus-4-20250514",
- "claude-opus-4-6-20260604",
- "claude-haiku-3-5-20241022"
-]
# Named model presets for 'ruby hyperstack.rb --config hyperstack-vm1.toml model switch <name>'.
# Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
diff --git a/hyperstack-vm2.toml b/hyperstack-vm2.toml
index e8e9b00..202a340 100644
--- a/hyperstack-vm2.toml
+++ b/hyperstack-vm2.toml
@@ -41,8 +41,6 @@ allowed_ssh_cidrs = ["auto"]
allowed_wireguard_cidrs = ["auto"]
# Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
ollama_port = 11434
-# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
-litellm_port = 4000
[bootstrap]
enable_guest_bootstrap = true
@@ -60,7 +58,7 @@ num_parallel = 1
context_length = 32768
pull_models = ["qwen3-coder-next"]
-# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# vLLM serves one model via Docker on the OpenAI-compatible API.
# VM2 defaults to qwen3-coder-next; use 'model switch' to load any other preset.
[vllm]
install = true
@@ -72,14 +70,6 @@ max_model_len = 262144
gpu_memory_utilization = 0.92
tensor_parallel_size = 1
tool_call_parser = "qwen3_coder"
-# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
-litellm_master_key = "sk-litellm-master"
-litellm_claude_model_names = [
- "claude-sonnet-4-20250514",
- "claude-opus-4-20250514",
- "claude-opus-4-6-20260604",
- "claude-haiku-3-5-20241022"
-]
# Named model presets for 'ruby hyperstack.rb --config hyperstack-vm2.toml model switch <name>'.
# Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
diff --git a/hyperstack.rb b/hyperstack.rb
index 7cd817d..a3af491 100755
--- a/hyperstack.rb
+++ b/hyperstack.rb
@@ -87,7 +87,6 @@ module HyperstackVM
# Set to a different address (e.g. 192.168.3.3) for a second VM sharing the same wg1 tunnel.
'wireguard_server_ip' => nil,
'ollama_port' => 11_434,
- 'litellm_port' => 4_000,
'allowed_ssh_cidrs' => ['auto'],
'allowed_wireguard_cidrs' => ['auto']
},
@@ -114,14 +113,7 @@ module HyperstackVM
'max_model_len' => 262_144,
'gpu_memory_utilization' => 0.92,
'tensor_parallel_size' => 1,
- 'tool_call_parser' => 'qwen3_coder',
- 'litellm_claude_model_names' => %w[
- claude-sonnet-4-20250514
- claude-opus-4-20250514
- claude-opus-4-6-20260604
- claude-haiku-3-5-20241022
- ],
- 'litellm_master_key' => 'sk-litellm-master'
+ 'tool_call_parser' => 'qwen3_coder'
},
'wireguard' => {
'auto_setup' => true,
@@ -338,10 +330,6 @@ module HyperstackVM
Integer(fetch('network', 'ollama_port'))
end
- def litellm_port
- Integer(fetch('network', 'litellm_port'))
- end
-
# Returns the server-side WireGuard IP for this VM.
# Uses the explicitly configured address when set; otherwise derives it as subnet_base + 1.
# Example: 192.168.3.0/24 → 192.168.3.1 (default VM1); VM2 sets wireguard_server_ip=192.168.3.3.
@@ -453,14 +441,6 @@ module HyperstackVM
fetch('vllm', 'tool_call_parser')
end
- def litellm_claude_model_names
- Array(fetch('vllm', 'litellm_claude_model_names')).map(&:to_s)
- end
-
- def litellm_master_key
- fetch('vllm', 'litellm_master_key')
- end
-
# Whether to pass --trust-remote-code to vLLM for the default model.
# Required for architectures not yet in the vLLM upstream registry (e.g. nemotron_h).
def vllm_trust_remote_code
@@ -530,7 +510,6 @@ module HyperstackVM
end
rules << firewall_rule('tcp', ollama_port, wireguard_subnet) if include_ollama || include_vllm
- rules << firewall_rule('tcp', litellm_port, wireguard_subnet) if include_vllm
rules.uniq
end
@@ -1081,8 +1060,6 @@ module HyperstackVM
script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true"
# Port 11434 is shared by Ollama and vLLM; open for both regardless of which is installed.
script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Inference API (Ollama/vLLM) via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
- # Port 4000: LiteLLM proxy (Anthropic API -> vLLM); open alongside the inference port.
- script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.litellm_port} proto tcp comment 'LiteLLM proxy via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
end
if @config.configure_ollama_host?
@@ -1248,71 +1225,6 @@ module HyperstackVM
script.join("\n")
end
- def litellm_install_script(model_override: nil)
- port = @config.litellm_port
- model = model_override || @config.vllm_model
-
- script = []
- script << 'set -euo pipefail'
- script << 'sudo apt-get install -y python3.12-venv'
- script << 'sudo mkdir -p /ephemeral/litellm-env'
- script << 'sudo chown ubuntu:ubuntu /ephemeral/litellm-env'
- script << 'python3 -m venv /ephemeral/litellm-env'
- script << '/ephemeral/litellm-env/bin/pip install --quiet "litellm[proxy]"'
- script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
- script << 'model_list:'
- script.concat(litellm_model_entries(model))
- script << ''
- script << 'litellm_settings:'
- script << ' drop_params: true'
- script << ''
- script << 'general_settings:'
- script << " master_key: \"#{@config.litellm_master_key}\""
- script << 'LITELLM_YAML'
- script << "sudo tee /etc/systemd/system/litellm.service > /dev/null << 'LITELLM_UNIT'"
- script << '[Unit]'
- script << 'Description=LiteLLM Proxy'
- script << 'After=network.target docker.service'
- script << 'Requires=docker.service'
- script << ''
- script << '[Service]'
- script << 'Type=simple'
- script << 'User=ubuntu'
- script << "ExecStart=/ephemeral/litellm-env/bin/litellm --config /ephemeral/litellm-config.yaml --host 0.0.0.0 --port #{port}"
- script << 'Restart=always'
- script << 'RestartSec=5'
- script << ''
- script << '[Install]'
- script << 'WantedBy=multi-user.target'
- script << 'LITELLM_UNIT'
- script << 'sudo systemctl daemon-reload'
- script << 'sudo systemctl enable --now litellm'
- script << 'sleep 5'
- script << 'systemctl is-active --quiet litellm'
- script << 'echo litellm-install-ok'
- script.join("\n")
- end
-
- def litellm_reload_script(model)
- script = []
- script << 'set -euo pipefail'
- script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
- script << 'model_list:'
- script.concat(litellm_model_entries(model))
- script << ''
- script << 'litellm_settings:'
- script << ' drop_params: true'
- script << ''
- script << 'general_settings:'
- script << " master_key: \"#{@config.litellm_master_key}\""
- script << 'LITELLM_YAML'
- script << 'sudo systemctl restart litellm'
- script << 'sleep 3'
- script << 'systemctl is-active --quiet litellm'
- script << 'echo litellm-reload-ok'
- script.join("\n")
- end
-
private
def normalized_model_list(models)
@@ -1324,19 +1236,6 @@ module HyperstackVM
end
end
- def litellm_model_entries(model)
- vllm_port = @config.ollama_port
-
- @config.litellm_claude_model_names.flat_map do |name|
- [
- " - model_name: \"#{name}\"",
- ' litellm_params:',
- " model: \"hosted_vllm/#{model}\"",
- " api_base: \"http://localhost:#{vllm_port}/v1\"",
- ' api_key: "EMPTY"'
- ]
- end
- end
end
class RemoteProvisioner
@@ -1390,22 +1289,8 @@ module HyperstackVM
raise Error, "vLLM install failed: #{output.strip}" unless status.success?
end
- def install_litellm(host, model:)
- info "Setting up LiteLLM Anthropic-API proxy on #{host}..."
- output, status = @ssh_stream_runner.call(host, @scripts.litellm_install_script(model_override: model))
- raise Error, "LiteLLM install failed: #{output.strip}" unless status.success?
- end
-
- def reload_litellm(host, model)
- info "Reloading LiteLLM proxy config for #{model}..."
- output, status = @ssh_stream_runner.call(host, @scripts.litellm_reload_script(model))
- raise Error, "LiteLLM reload failed: #{output.strip}" unless status.success?
- end
-
def setup_vllm_stack(host, preset_config: nil)
install_vllm(host, preset_config: preset_config)
- model = preset_config&.dig('model') || @config.vllm_model
- install_litellm(host, model: model)
end
private
@@ -1598,7 +1483,7 @@ module HyperstackVM
end
# Switches the running VM to a different named model preset.
- # Stops the old container, starts the new one, and hot-reloads LiteLLM config.
+ # Stops the old container, then starts the new vLLM container in its place.
def switch_model(preset_name:, dry_run: false)
preset = @config.vllm_preset(preset_name) # raises if unknown
state = @state_store.load
@@ -1633,10 +1518,6 @@ module HyperstackVM
# surprise multi-GB download if the upstream image was updated.
@provisioner.install_vllm(host, preset_config: preset, pull_image: false)
- # Hot-reload LiteLLM: rewrite config for the new model and restart the service.
- # Skips venv/apt install since those are already in place.
- @provisioner.reload_litellm(host, preset['model'])
-
state['vllm_model'] = preset['model']
state['vllm_container_name'] = new_container
state['vllm_preset'] = preset_name
@@ -1650,7 +1531,7 @@ module HyperstackVM
info "Run 'ruby hyperstack.rb test' to verify."
end
- # Runs end-to-end inference tests against vLLM and LiteLLM over WireGuard.
+ # Runs end-to-end inference tests against the active inference services over WireGuard.
# Requires wg1 to be active and the VM to be fully provisioned.
def test
state = @state_store.load
@@ -1663,7 +1544,6 @@ module HyperstackVM
if vllm_enabled
test_vllm(wg_ip)
- test_litellm(wg_ip)
end
info " Ollama test: connect via SSH and run 'ollama list' to verify models." if ollama_enabled
@@ -1731,7 +1611,7 @@ module HyperstackVM
@state_store.save(state)
end
- # Set up vLLM (Docker container) + LiteLLM (Anthropic-API proxy) after
+ # Set up vLLM after
# the tunnel is up so that model-download progress is visible locally.
if vllm_setup_needed?(state)
preset_cfg = effective_vllm_preset_config
@@ -1755,9 +1635,8 @@ module HyperstackVM
return unless effective_vllm?
wg_ip = @config.wireguard_gateway_hostname
- info "Run 'ruby hyperstack.rb test' to verify vLLM and LiteLLM."
+ info "Run 'ruby hyperstack.rb test' to verify vLLM."
info " vLLM: http://#{wg_ip}:#{@config.ollama_port}/v1/models"
- info " LiteLLM: http://#{wg_ip}:#{@config.litellm_port}/v1/messages"
end
def build_create_payload(vm_name, resolved)
@@ -2138,9 +2017,9 @@ module HyperstackVM
end
def service_mode_summary(vllm_enabled:, ollama_enabled:)
- return 'vLLM+LiteLLM enabled, Ollama enabled' if vllm_enabled && ollama_enabled
- return 'vLLM+LiteLLM enabled, Ollama disabled' if vllm_enabled
- return 'Ollama enabled, vLLM+LiteLLM disabled' if ollama_enabled
+ return 'vLLM enabled, Ollama enabled' if vllm_enabled && ollama_enabled
+ return 'vLLM enabled, Ollama disabled' if vllm_enabled
+ return 'Ollama enabled, vLLM disabled' if ollama_enabled
'All inference services disabled'
end
@@ -2204,8 +2083,6 @@ module HyperstackVM
preset_note = @effective_vllm_preset ? " (preset: #{@effective_vllm_preset})" : ''
info "vLLM will be installed: #{vllm_m}#{preset_note}"
info " Container: #{vllm_cname}, port #{@config.ollama_port}, max_model_len #{vllm_maxlen}"
- info "LiteLLM proxy will be installed on port #{@config.litellm_port}"
- info " Claude model aliases: #{@config.litellm_claude_model_names.join(', ')}"
end
if @config.wireguard_auto_setup?
info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>"
@@ -2233,7 +2110,6 @@ module HyperstackVM
end
if vllm_setup_needed?(state)
info "vLLM would be installed: #{@config.vllm_model}"
- info "LiteLLM proxy would be installed on port #{@config.litellm_port}"
end
if wireguard_setup_needed?(state)
info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}"
@@ -2325,35 +2201,6 @@ module HyperstackVM
raise Error, "Cannot reach vLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
end
- # Tests the LiteLLM proxy using the Anthropic Messages API format,
- # which is what Claude Code sends when pointed at a custom base URL.
- def test_litellm(wg_ip)
- port = @config.litellm_port
- model = @config.litellm_claude_model_names.first
- key = @config.litellm_master_key
-
- info " Testing LiteLLM proxy at http://#{wg_ip}:#{port}/v1/messages..."
- uri = URI("http://#{wg_ip}:#{port}/v1/messages")
- req = Net::HTTP::Post.new(uri)
- req['Content-Type'] = 'application/json'
- req['x-api-key'] = key
- req['anthropic-version'] = '2023-06-01'
- req.body = JSON.generate(
- 'model' => model,
- # 500 tokens: reasoning models (e.g. gpt-oss) consume tokens on chain-of-thought
- # before producing content; 50 is too small and yields an empty content field.
- 'max_tokens' => 500,
- 'messages' => [{ 'role' => 'user', 'content' => 'Say hello in five words.' }]
- )
- resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
- raise Error, "LiteLLM returned HTTP #{resp.code}: #{resp.body}" unless resp.code == '200'
-
- text = JSON.parse(resp.body).fetch('content', []).find { |b| b['type'] == 'text' }&.dig('text').to_s.strip
- info " LiteLLM response: #{text}"
- rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
- raise Error, "Cannot reach LiteLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
- end
-
# Sends a single OpenAI chat completion request and returns the reply text.
def vllm_chat(host, port, model, prompt)
uri = URI("http://#{host}:#{port}/v1/chat/completions")
@@ -2547,8 +2394,8 @@ module HyperstackVM
OptionParser.new do |o|
o.on('--replace', 'Delete the tracked VM before creating a new one') { opts[:replace] = true }
o.on('--dry-run', 'Print the create plan without creating a VM') { opts[:dry_run] = true }
- o.on('--vllm', 'Enable vLLM+LiteLLM setup (overrides config)') { opts[:install_vllm] = true }
- o.on('--no-vllm', 'Disable vLLM+LiteLLM setup (overrides config)') { opts[:install_vllm] = false }
+ o.on('--vllm', 'Enable vLLM setup (overrides config)') { opts[:install_vllm] = true }
+ o.on('--no-vllm', 'Disable vLLM setup (overrides config)') { opts[:install_vllm] = false }
o.on('--ollama', 'Enable Ollama setup (overrides config)') { opts[:install_ollama] = true }
o.on('--no-ollama', 'Disable Ollama setup (overrides config)') { opts[:install_ollama] = false }
o.on('--model PRESET', 'Use a named vLLM preset at create time') { |v| opts[:vllm_preset] = v } if include_model_preset
diff --git a/pi/agent/extensions/btw/README.md b/pi/agent/extensions/btw/README.md
index cf39e1c..61092ae 100644
--- a/pi/agent/extensions/btw/README.md
+++ b/pi/agent/extensions/btw/README.md
@@ -2,7 +2,7 @@
Ephemeral side questions for Pi.
-This extension adds `/btw`, modeled after Claude Code's side-question flow:
+This extension adds `/btw`, modeled after Pi's side-question flow:
- it uses the current branch conversation as context
- it asks a separate one-shot question with the current model
diff --git a/vllm-setup.txt b/vllm-setup.txt
index 9ea44a7..cb64432 100644
--- a/vllm-setup.txt
+++ b/vllm-setup.txt
@@ -1,22 +1,16 @@
-# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+# vLLM Setup for Hyperstack VM
#
# This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
-# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+# via vLLM exposed directly on the OpenAI-compatible API.
#
# Architecture:
#
-# Claude Code (earth) Hyperstack VM (A100 80GB)
+# Pi (earth) Hyperstack VM (A100 80GB)
# ┌─────────────┐ ┌──────────────────────────────┐
-# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │
-# │ │ /v1/messages │ translates Anthropic → │
-# │ │ via WireGuard wg1 │ OpenAI chat completions │
-# └─────────────┘ │ │ │
-# │ ▼ │
-# OpenCode (earth) │ vLLM engine (:11434) │
-# ┌─────────────┐ │ /v1/chat/completions │
-# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │
-# │ │ /v1/chat/completions│ prefix caching │
-# └─────────────┘ │ bullpoint/Qwen3-Coder- │
+# │ pi │── OpenAI API ──────> │ vLLM engine (:11434) │
+# │ │ /v1/chat/completions│ FlashAttention v2 │
+# └─────────────┘ via WireGuard wg1 │ prefix caching │
+# │ bullpoint/Qwen3-Coder- │
# │ Next-AWQ-4bit (45GB) │
# └──────────────────────────────┘
#
@@ -27,12 +21,6 @@
# - Chunked prefill: can interleave prefill and decode
# - Marlin kernels for AWQ MoE quantization
#
-# Why LiteLLM:
-# - Claude Code speaks Anthropic Messages API (/v1/messages) only
-# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
-# - LiteLLM translates between them, mapping Claude model names to the
-# actual vLLM model
-#
# Model details:
# - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
# - Architecture: MoE, 80B total params, 3B active per token
@@ -54,8 +42,7 @@
#
# Ports:
# 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
-# 4000/tcp - LiteLLM Anthropic-compatible proxy
-# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+# Restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
# ===========================================================================
# STEP 1: Prerequisites
@@ -130,132 +117,21 @@
# docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
# ===========================================================================
-# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
-# ===========================================================================
-# Install in a Python venv (Ubuntu 24.04 requires this):
-#
-# sudo apt-get install -y python3.12-venv
-# sudo mkdir -p /ephemeral/litellm-env
-# sudo chown ubuntu:ubuntu /ephemeral/litellm-env
-# python3 -m venv /ephemeral/litellm-env
-# /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
-#
-# Write config file:
-#
-# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
-# model_list:
-# - model_name: "claude-sonnet-4-20250514"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-# - model_name: "claude-opus-4-20250514"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-# - model_name: "claude-opus-4-6-20260604"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-# - model_name: "claude-haiku-3-5-20241022"
-# litellm_params:
-# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-# api_base: "http://localhost:11434/v1"
-# api_key: "EMPTY"
-#
-# litellm_settings:
-# drop_params: true
-#
-# general_settings:
-# master_key: "sk-litellm-master"
-# YAML
-#
-# Config notes:
-# - model_name values must match what Claude Code sends (Claude model IDs)
-# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
-# (not /v1/responses which vLLM doesn't fully support for complex messages)
-# - drop_params: true — silently drops Claude-specific parameters like
-# context_management that vLLM doesn't understand
-# - master_key is the API key clients must send
-# - Add new model_name entries when Anthropic releases new model IDs
-#
-# Start LiteLLM:
-#
-# nohup /ephemeral/litellm-env/bin/litellm \
-# --config /ephemeral/litellm-config.yaml \
-# --host 0.0.0.0 \
-# --port 4000 \
-# > /ephemeral/litellm.log 2>&1 &
-#
-# Verify:
-# curl -s http://localhost:4000/v1/messages \
-# -H "Content-Type: application/json" \
-# -H "x-api-key: sk-litellm-master" \
-# -H "anthropic-version: 2023-06-01" \
-# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-# "messages":[{"role":"user","content":"Hello"}]}'
-#
-# For production, create a systemd service instead of nohup:
-#
-# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
-# [Unit]
-# Description=LiteLLM Proxy
-# After=network.target docker.service
-# Requires=docker.service
-#
-# [Service]
-# Type=simple
-# User=ubuntu
-# ExecStart=/ephemeral/litellm-env/bin/litellm \
-# --config /ephemeral/litellm-config.yaml \
-# --host 0.0.0.0 --port 4000
-# Restart=always
-# RestartSec=5
-#
-# [Install]
-# WantedBy=multi-user.target
-# UNIT
-#
-# sudo systemctl daemon-reload
-# sudo systemctl enable --now litellm
-
-# ===========================================================================
-# STEP 5: Firewall rules
+# STEP 4: Firewall rules
# ===========================================================================
# Allow access from WireGuard subnet only:
#
# sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
# comment 'vLLM via wg1'
-# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
-# comment 'LiteLLM proxy via wg1'
-
# ===========================================================================
-# STEP 6: Client configuration (on earth / local machine)
+# STEP 5: Client configuration (on earth / local machine)
# ===========================================================================
#
-# --- Claude Code ---
-# Launch with environment variables pointing at LiteLLM proxy:
-#
-# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-# ANTHROPIC_API_KEY=sk-litellm-master \
-# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
-#
-# Fish shell alias (add to ~/.config/fish/config.fish):
-#
-# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-# ANTHROPIC_API_KEY=sk-litellm-master \
-# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
-#
-# --- OpenCode ---
-# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+# Launch Pi or any OpenAI-compatible client directly against vLLM:
#
# OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
# OPENAI_API_KEY=EMPTY \
-# opencode
-#
-# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+# pi
# ===========================================================================
# STEP 7: Monitoring & troubleshooting
@@ -267,8 +143,7 @@
# - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
# - GPU KV cache usage: % of KV cache memory in use (proportional to
# active context length vs max capacity)
-# - Prefix cache hit rate: % of prompt tokens served from cache (0% for
-# Claude Code, higher for OpenCode)
+# - Prefix cache hit rate: % of prompt tokens served from cache
# - Running/Waiting: active and queued request counts
#
# Follow live (all stats):
@@ -292,9 +167,6 @@
# Useful for periodic checks without following the log:
# docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
#
-# --- LiteLLM proxy log ---
-# tail -f /ephemeral/litellm.log
-#
# --- GPU hardware stats ---
# Snapshot:
# nvidia-smi
@@ -310,8 +182,7 @@
# Decode throughput: 40-99 tok/s (varies with output length per sample)
# KV cache usage: 0-5% for short conversations, grows with context
# (100% = 298k tokens, at which point requests queue)
-# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix)
-# >50% for OpenCode after a few turns
+# Prefix cache hit: depends on prompt reuse; higher is better
# Temperature: 44-60C under load, <45C idle
# Power: 70W idle, 230-240W under load, 300W max
#
@@ -326,24 +197,10 @@
# 1. OOM on startup with --max-model-len 262144
# → Reduce to 131072 or 65536
#
-# 2. "model does not exist" from vLLM
-# → Model name in LiteLLM config must exactly match HuggingFace repo name
-#
-# 3. LiteLLM returns UnsupportedParamsError
-# → Ensure drop_params: true is in litellm_settings
-#
-# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
-# → Use "hosted_vllm/" prefix in model field, not "openai/"
-#
-# 5. Claude Code "Auth conflict" warning
-# → Run `claude /logout` first to clear the claude.ai session token,
-# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
-#
-# 6. Prefix cache hit rate stays at 0%
-# → Normal for Claude Code (it mutates the prompt prefix each turn)
-# → OpenCode should show increasing cache hit rates after a few turns
+# 2. Prefix cache hit rate stays at 0%
+# → Normal when prompts vary heavily turn-to-turn
#
-# 7. vLLM container won't start (CUDA version mismatch)
+# 3. vLLM container won't start (CUDA version mismatch)
# → Check driver version: nvidia-smi
# → vLLM requires CUDA >= 12.x and driver >= 535
@@ -402,19 +259,6 @@
# --host 0.0.0.0 \
# --port 11434
#
-# --- Update LiteLLM config to match ---
-# After switching models, update the model field in litellm-config.yaml
-# to match the new HuggingFace model name:
-#
-# model: "hosted_vllm/<new-model-name>"
-#
-# Then restart LiteLLM:
-# pkill -f litellm
-# nohup /ephemeral/litellm-env/bin/litellm \
-# --config /ephemeral/litellm-config.yaml \
-# --host 0.0.0.0 --port 4000 \
-# > /ephemeral/litellm.log 2>&1 &
-#
# --- Finding models ---
# Search HuggingFace for vLLM-compatible quantized models:
# https://huggingface.co/models?search=<model-name>+awq
@@ -454,14 +298,6 @@
# "messages":[{"role":"user","content":"Hello"}],
# "max_tokens":50}'
#
-# Test via LiteLLM (Anthropic API):
-# curl -s http://localhost:4000/v1/messages \
-# -H "Content-Type: application/json" \
-# -H "x-api-key: sk-litellm-master" \
-# -H "anthropic-version: 2023-06-01" \
-# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-# "messages":[{"role":"user","content":"Hello"}]}'
-
# ===========================================================================
# Performance characteristics (A100 80GB PCIe, single GPU)
# ===========================================================================
@@ -472,7 +308,7 @@
# vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited)
# Per-turn latency: ~10-15s (small prompts, early conversation)
# KV cache usage: 2-5% for typical coding sessions
-# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode)
+# Prefix cache hit rate: workload-dependent
#
# Comparison with Ollama on same hardware (A100 80GB PCIe):
#
@@ -482,6 +318,5 @@
# Decode throughput | ~40 tok/s | 40-99 tok/s
# Per-turn latency | ~28s (32k ctx) | ~10-15s
# Context window | 32k (was truncating) | 262k (full, no truncation)
-# Prefix cache (Claude) | 0% always | 0% always
-# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better
+# Prefix cache | workload-dependent | workload-dependent
# VRAM usage | 52-61 GiB | 75 GiB (more KV cache)