diff options
| author | Paul Buetow <paul@buetow.org> | 2026-03-21 10:49:35 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-03-21 10:49:35 +0200 |
| commit | ea0f9f7f51b32f0c392f75aa0cc3231211f54757 (patch) | |
| tree | 378d01dbc87dc0ef9f4fbd6ec7788e0a62f66876 | |
| parent | 4baa087445a11b856139f55adab262fa97384033 (diff) | |
Remove LiteLLM and Claude Code repo references (task 301)
| -rw-r--r-- | README.md | 46 | ||||
| -rw-r--r-- | hyperstack-vm.toml | 14 | ||||
| -rw-r--r-- | hyperstack-vm1.toml | 12 | ||||
| -rw-r--r-- | hyperstack-vm2.toml | 12 | ||||
| -rwxr-xr-x | hyperstack.rb | 173 | ||||
| -rw-r--r-- | pi/agent/extensions/btw/README.md | 2 | ||||
| -rw-r--r-- | vllm-setup.txt | 203 |
7 files changed, 41 insertions, 421 deletions
@@ -1,6 +1,6 @@ # hyperstack -Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy. +Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference. Runs two A100 VMs concurrently — each serving a different model — with [Pi](https://pi.dev) coding agents connected to each. ## Architecture @@ -18,9 +18,6 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi]( │ vLLM (:11434) │ │ vLLM (:11434) │ │ Nemotron-3-Super 120B │ │ Qwen3-Coder-Next 80B (MoE) │ │ (hybrid Mamba+MoE, AWQ-4b) │ │ (AWQ-4bit) │ - │ │ │ │ - │ LiteLLM (:4000) │ │ LiteLLM (:4000) │ - │ Anthropic API → OpenAI │ │ Anthropic API → OpenAI │ └──────────────────────────────┘ └──────────────────────────────────┘ ▲ ▲ │ OpenAI /v1/chat/completions │ OpenAI /v1/chat/completions @@ -33,7 +30,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi]( ``` Both VMs share a single WireGuard interface (`wg1`) on the local machine. -Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation. +Each VM runs one vLLM model exposed directly to Pi over the OpenAI-compatible API. ## Prerequisites @@ -49,7 +46,7 @@ Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation. ## Quickstart (two-VM setup) ```bash -# Deploy both VMs in parallel, set up WireGuard + vLLM + LiteLLM (~10 min) +# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min) ruby hyperstack.rb create-both # Verify both VMs are working @@ -144,33 +141,6 @@ Available presets (both VMs share the same set): | `qwen3-32b` | Qwen3-32B (AWQ) | ~18 GB | 32K | | `devstral` | Devstral-Small-2507 (AWQ-4bit) | ~15 GB | 32K | -## Using Claude Code with vLLM - -WireGuard (`wg1`) must be active before connecting. - -```bash -ANTHROPIC_BASE_URL=http://hyperstack1.wg1:4000 \ -ANTHROPIC_API_KEY=sk-litellm-master \ -claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions -``` - -If you see an **"Auth conflict"** warning, clear the saved claude.ai session first: - -```bash -claude /logout -``` - -**Available model aliases** — all map to the same vLLM model on that VM: - -| Alias | Use case | -|-------|----------| -| `claude-opus-4-6-20260604` | Recommended (most future-proof) | -| `claude-opus-4-20250514` | | -| `claude-sonnet-4-20250514` | | -| `claude-haiku-3-5-20241022` | | - -Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in the TOML as they are released. - ## CLI reference ``` @@ -182,13 +152,13 @@ Commands: delete Destroy the tracked VM delete-both Destroy both VM1 and VM2 status Show VM and WireGuard status - test Run end-to-end inference tests (vLLM + LiteLLM) + test Run end-to-end inference tests (vLLM) model switch <preset> Hot-switch the running vLLM model create / create-both options: --replace Delete existing tracked VM before creating --dry-run Print the plan without making changes - --vllm / --no-vllm Override config: enable/disable vLLM+LiteLLM setup + --vllm / --no-vllm Override config: enable/disable vLLM setup --ollama / --no-ollama Override config: enable/disable Ollama setup ``` @@ -200,7 +170,7 @@ Key sections: | Section | Purpose | |---------|---------| | `[vm]` | Flavor, image, environment name | -| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases | +| `[vllm]` | Model, container settings, and vLLM runtime options | | `[vllm.presets.*]` | Named model presets for hot-switching | | `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) | | `[network]` | Ports, WireGuard subnet, allowed CIDRs | @@ -222,8 +192,6 @@ ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"' # GPU stats (every 5 s) ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5' -# LiteLLM proxy log -ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm' ``` Healthy baseline (A100 80GB PCIe): @@ -234,4 +202,4 @@ Healthy baseline (A100 80GB PCIe): | Decode throughput | 40–99 tok/s | | KV cache usage | 2–5% for typical sessions | -See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting. +See `vllm-setup.txt` for detailed vLLM setup notes, VRAM sizing guide, and troubleshooting. diff --git a/hyperstack-vm.toml b/hyperstack-vm.toml index e82c97f..28de975 100644 --- a/hyperstack-vm.toml +++ b/hyperstack-vm.toml @@ -37,8 +37,6 @@ allowed_ssh_cidrs = ["auto"] allowed_wireguard_cidrs = ["auto"] # Port 11434 is shared by both Ollama and vLLM for firewall compatibility. ollama_port = 11434 -# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM). -litellm_port = 4000 [bootstrap] enable_guest_bootstrap = true @@ -56,7 +54,7 @@ num_parallel = 1 context_length = 32768 pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"] -# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI. +# vLLM serves one model via Docker on the OpenAI-compatible API. # Use --vllm / --no-vllm CLI flags to override install at runtime. [vllm] install = true @@ -68,14 +66,6 @@ max_model_len = 262144 gpu_memory_utilization = 0.92 tensor_parallel_size = 1 tool_call_parser = "qwen3_coder" -# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here. -litellm_master_key = "sk-litellm-master" -litellm_claude_model_names = [ - "claude-sonnet-4-20250514", - "claude-opus-4-20250514", - "claude-opus-4-6-20260604", - "claude-haiku-3-5-20241022" -] # Named model presets for 'ruby hyperstack.rb model switch <name>'. # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults. @@ -127,7 +117,7 @@ tool_call_parser = "" # OpenAI GPT-OSS 120B — powerful MoE (5.1B active / 117B total, MXFP4), ~65 GB on A100. # Hard architecture limit: max_position_embeddings=131072 in model config.json. # 131072 is the absolute ceiling — exceeding it causes NaN or CUDA OOB errors. -# For sessions approaching this limit, start a fresh opencode conversation. +# For sessions approaching this limit, start a fresh Pi conversation. # tool_call_parser = "" disables --enable-auto-tool-choice (same reason as gpt-oss-20b). [vllm.presets.gpt-oss-120b] model = "openai/gpt-oss-120b" diff --git a/hyperstack-vm1.toml b/hyperstack-vm1.toml index 1b116bd..6109472 100644 --- a/hyperstack-vm1.toml +++ b/hyperstack-vm1.toml @@ -41,8 +41,6 @@ allowed_ssh_cidrs = ["auto"] allowed_wireguard_cidrs = ["auto"] # Port 11434 is shared by both Ollama and vLLM for firewall compatibility. ollama_port = 11434 -# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM). -litellm_port = 4000 [bootstrap] enable_guest_bootstrap = true @@ -60,7 +58,7 @@ num_parallel = 1 context_length = 32768 pull_models = ["nemotron-3-super"] -# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI. +# vLLM serves one model via Docker on the OpenAI-compatible API. # VM1 defaults to nemotron-3-super; use 'model switch' to load any other preset. [vllm] install = true @@ -75,14 +73,6 @@ tensor_parallel_size = 1 tool_call_parser = "qwen3_xml" trust_remote_code = true extra_vllm_args = ["--reasoning-parser", "nemotron_v3"] -# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here. -litellm_master_key = "sk-litellm-master" -litellm_claude_model_names = [ - "claude-sonnet-4-20250514", - "claude-opus-4-20250514", - "claude-opus-4-6-20260604", - "claude-haiku-3-5-20241022" -] # Named model presets for 'ruby hyperstack.rb --config hyperstack-vm1.toml model switch <name>'. # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults. diff --git a/hyperstack-vm2.toml b/hyperstack-vm2.toml index e8e9b00..202a340 100644 --- a/hyperstack-vm2.toml +++ b/hyperstack-vm2.toml @@ -41,8 +41,6 @@ allowed_ssh_cidrs = ["auto"] allowed_wireguard_cidrs = ["auto"] # Port 11434 is shared by both Ollama and vLLM for firewall compatibility. ollama_port = 11434 -# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM). -litellm_port = 4000 [bootstrap] enable_guest_bootstrap = true @@ -60,7 +58,7 @@ num_parallel = 1 context_length = 32768 pull_models = ["qwen3-coder-next"] -# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI. +# vLLM serves one model via Docker on the OpenAI-compatible API. # VM2 defaults to qwen3-coder-next; use 'model switch' to load any other preset. [vllm] install = true @@ -72,14 +70,6 @@ max_model_len = 262144 gpu_memory_utilization = 0.92 tensor_parallel_size = 1 tool_call_parser = "qwen3_coder" -# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here. -litellm_master_key = "sk-litellm-master" -litellm_claude_model_names = [ - "claude-sonnet-4-20250514", - "claude-opus-4-20250514", - "claude-opus-4-6-20260604", - "claude-haiku-3-5-20241022" -] # Named model presets for 'ruby hyperstack.rb --config hyperstack-vm2.toml model switch <name>'. # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults. diff --git a/hyperstack.rb b/hyperstack.rb index 7cd817d..a3af491 100755 --- a/hyperstack.rb +++ b/hyperstack.rb @@ -87,7 +87,6 @@ module HyperstackVM # Set to a different address (e.g. 192.168.3.3) for a second VM sharing the same wg1 tunnel. 'wireguard_server_ip' => nil, 'ollama_port' => 11_434, - 'litellm_port' => 4_000, 'allowed_ssh_cidrs' => ['auto'], 'allowed_wireguard_cidrs' => ['auto'] }, @@ -114,14 +113,7 @@ module HyperstackVM 'max_model_len' => 262_144, 'gpu_memory_utilization' => 0.92, 'tensor_parallel_size' => 1, - 'tool_call_parser' => 'qwen3_coder', - 'litellm_claude_model_names' => %w[ - claude-sonnet-4-20250514 - claude-opus-4-20250514 - claude-opus-4-6-20260604 - claude-haiku-3-5-20241022 - ], - 'litellm_master_key' => 'sk-litellm-master' + 'tool_call_parser' => 'qwen3_coder' }, 'wireguard' => { 'auto_setup' => true, @@ -338,10 +330,6 @@ module HyperstackVM Integer(fetch('network', 'ollama_port')) end - def litellm_port - Integer(fetch('network', 'litellm_port')) - end - # Returns the server-side WireGuard IP for this VM. # Uses the explicitly configured address when set; otherwise derives it as subnet_base + 1. # Example: 192.168.3.0/24 → 192.168.3.1 (default VM1); VM2 sets wireguard_server_ip=192.168.3.3. @@ -453,14 +441,6 @@ module HyperstackVM fetch('vllm', 'tool_call_parser') end - def litellm_claude_model_names - Array(fetch('vllm', 'litellm_claude_model_names')).map(&:to_s) - end - - def litellm_master_key - fetch('vllm', 'litellm_master_key') - end - # Whether to pass --trust-remote-code to vLLM for the default model. # Required for architectures not yet in the vLLM upstream registry (e.g. nemotron_h). def vllm_trust_remote_code @@ -530,7 +510,6 @@ module HyperstackVM end rules << firewall_rule('tcp', ollama_port, wireguard_subnet) if include_ollama || include_vllm - rules << firewall_rule('tcp', litellm_port, wireguard_subnet) if include_vllm rules.uniq end @@ -1081,8 +1060,6 @@ module HyperstackVM script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true" # Port 11434 is shared by Ollama and vLLM; open for both regardless of which is installed. script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Inference API (Ollama/vLLM) via #{@config.local_interface_name}' >/dev/null 2>&1 || true" - # Port 4000: LiteLLM proxy (Anthropic API -> vLLM); open alongside the inference port. - script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.litellm_port} proto tcp comment 'LiteLLM proxy via #{@config.local_interface_name}' >/dev/null 2>&1 || true" end if @config.configure_ollama_host? @@ -1248,71 +1225,6 @@ module HyperstackVM script.join("\n") end - def litellm_install_script(model_override: nil) - port = @config.litellm_port - model = model_override || @config.vllm_model - - script = [] - script << 'set -euo pipefail' - script << 'sudo apt-get install -y python3.12-venv' - script << 'sudo mkdir -p /ephemeral/litellm-env' - script << 'sudo chown ubuntu:ubuntu /ephemeral/litellm-env' - script << 'python3 -m venv /ephemeral/litellm-env' - script << '/ephemeral/litellm-env/bin/pip install --quiet "litellm[proxy]"' - script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'" - script << 'model_list:' - script.concat(litellm_model_entries(model)) - script << '' - script << 'litellm_settings:' - script << ' drop_params: true' - script << '' - script << 'general_settings:' - script << " master_key: \"#{@config.litellm_master_key}\"" - script << 'LITELLM_YAML' - script << "sudo tee /etc/systemd/system/litellm.service > /dev/null << 'LITELLM_UNIT'" - script << '[Unit]' - script << 'Description=LiteLLM Proxy' - script << 'After=network.target docker.service' - script << 'Requires=docker.service' - script << '' - script << '[Service]' - script << 'Type=simple' - script << 'User=ubuntu' - script << "ExecStart=/ephemeral/litellm-env/bin/litellm --config /ephemeral/litellm-config.yaml --host 0.0.0.0 --port #{port}" - script << 'Restart=always' - script << 'RestartSec=5' - script << '' - script << '[Install]' - script << 'WantedBy=multi-user.target' - script << 'LITELLM_UNIT' - script << 'sudo systemctl daemon-reload' - script << 'sudo systemctl enable --now litellm' - script << 'sleep 5' - script << 'systemctl is-active --quiet litellm' - script << 'echo litellm-install-ok' - script.join("\n") - end - - def litellm_reload_script(model) - script = [] - script << 'set -euo pipefail' - script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'" - script << 'model_list:' - script.concat(litellm_model_entries(model)) - script << '' - script << 'litellm_settings:' - script << ' drop_params: true' - script << '' - script << 'general_settings:' - script << " master_key: \"#{@config.litellm_master_key}\"" - script << 'LITELLM_YAML' - script << 'sudo systemctl restart litellm' - script << 'sleep 3' - script << 'systemctl is-active --quiet litellm' - script << 'echo litellm-reload-ok' - script.join("\n") - end - private def normalized_model_list(models) @@ -1324,19 +1236,6 @@ module HyperstackVM end end - def litellm_model_entries(model) - vllm_port = @config.ollama_port - - @config.litellm_claude_model_names.flat_map do |name| - [ - " - model_name: \"#{name}\"", - ' litellm_params:', - " model: \"hosted_vllm/#{model}\"", - " api_base: \"http://localhost:#{vllm_port}/v1\"", - ' api_key: "EMPTY"' - ] - end - end end class RemoteProvisioner @@ -1390,22 +1289,8 @@ module HyperstackVM raise Error, "vLLM install failed: #{output.strip}" unless status.success? end - def install_litellm(host, model:) - info "Setting up LiteLLM Anthropic-API proxy on #{host}..." - output, status = @ssh_stream_runner.call(host, @scripts.litellm_install_script(model_override: model)) - raise Error, "LiteLLM install failed: #{output.strip}" unless status.success? - end - - def reload_litellm(host, model) - info "Reloading LiteLLM proxy config for #{model}..." - output, status = @ssh_stream_runner.call(host, @scripts.litellm_reload_script(model)) - raise Error, "LiteLLM reload failed: #{output.strip}" unless status.success? - end - def setup_vllm_stack(host, preset_config: nil) install_vllm(host, preset_config: preset_config) - model = preset_config&.dig('model') || @config.vllm_model - install_litellm(host, model: model) end private @@ -1598,7 +1483,7 @@ module HyperstackVM end # Switches the running VM to a different named model preset. - # Stops the old container, starts the new one, and hot-reloads LiteLLM config. + # Stops the old container, then starts the new vLLM container in its place. def switch_model(preset_name:, dry_run: false) preset = @config.vllm_preset(preset_name) # raises if unknown state = @state_store.load @@ -1633,10 +1518,6 @@ module HyperstackVM # surprise multi-GB download if the upstream image was updated. @provisioner.install_vllm(host, preset_config: preset, pull_image: false) - # Hot-reload LiteLLM: rewrite config for the new model and restart the service. - # Skips venv/apt install since those are already in place. - @provisioner.reload_litellm(host, preset['model']) - state['vllm_model'] = preset['model'] state['vllm_container_name'] = new_container state['vllm_preset'] = preset_name @@ -1650,7 +1531,7 @@ module HyperstackVM info "Run 'ruby hyperstack.rb test' to verify." end - # Runs end-to-end inference tests against vLLM and LiteLLM over WireGuard. + # Runs end-to-end inference tests against the active inference services over WireGuard. # Requires wg1 to be active and the VM to be fully provisioned. def test state = @state_store.load @@ -1663,7 +1544,6 @@ module HyperstackVM if vllm_enabled test_vllm(wg_ip) - test_litellm(wg_ip) end info " Ollama test: connect via SSH and run 'ollama list' to verify models." if ollama_enabled @@ -1731,7 +1611,7 @@ module HyperstackVM @state_store.save(state) end - # Set up vLLM (Docker container) + LiteLLM (Anthropic-API proxy) after + # Set up vLLM after # the tunnel is up so that model-download progress is visible locally. if vllm_setup_needed?(state) preset_cfg = effective_vllm_preset_config @@ -1755,9 +1635,8 @@ module HyperstackVM return unless effective_vllm? wg_ip = @config.wireguard_gateway_hostname - info "Run 'ruby hyperstack.rb test' to verify vLLM and LiteLLM." + info "Run 'ruby hyperstack.rb test' to verify vLLM." info " vLLM: http://#{wg_ip}:#{@config.ollama_port}/v1/models" - info " LiteLLM: http://#{wg_ip}:#{@config.litellm_port}/v1/messages" end def build_create_payload(vm_name, resolved) @@ -2138,9 +2017,9 @@ module HyperstackVM end def service_mode_summary(vllm_enabled:, ollama_enabled:) - return 'vLLM+LiteLLM enabled, Ollama enabled' if vllm_enabled && ollama_enabled - return 'vLLM+LiteLLM enabled, Ollama disabled' if vllm_enabled - return 'Ollama enabled, vLLM+LiteLLM disabled' if ollama_enabled + return 'vLLM enabled, Ollama enabled' if vllm_enabled && ollama_enabled + return 'vLLM enabled, Ollama disabled' if vllm_enabled + return 'Ollama enabled, vLLM disabled' if ollama_enabled 'All inference services disabled' end @@ -2204,8 +2083,6 @@ module HyperstackVM preset_note = @effective_vllm_preset ? " (preset: #{@effective_vllm_preset})" : '' info "vLLM will be installed: #{vllm_m}#{preset_note}" info " Container: #{vllm_cname}, port #{@config.ollama_port}, max_model_len #{vllm_maxlen}" - info "LiteLLM proxy will be installed on port #{@config.litellm_port}" - info " Claude model aliases: #{@config.litellm_claude_model_names.join(', ')}" end if @config.wireguard_auto_setup? info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>" @@ -2233,7 +2110,6 @@ module HyperstackVM end if vllm_setup_needed?(state) info "vLLM would be installed: #{@config.vllm_model}" - info "LiteLLM proxy would be installed on port #{@config.litellm_port}" end if wireguard_setup_needed?(state) info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}" @@ -2325,35 +2201,6 @@ module HyperstackVM raise Error, "Cannot reach vLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})" end - # Tests the LiteLLM proxy using the Anthropic Messages API format, - # which is what Claude Code sends when pointed at a custom base URL. - def test_litellm(wg_ip) - port = @config.litellm_port - model = @config.litellm_claude_model_names.first - key = @config.litellm_master_key - - info " Testing LiteLLM proxy at http://#{wg_ip}:#{port}/v1/messages..." - uri = URI("http://#{wg_ip}:#{port}/v1/messages") - req = Net::HTTP::Post.new(uri) - req['Content-Type'] = 'application/json' - req['x-api-key'] = key - req['anthropic-version'] = '2023-06-01' - req.body = JSON.generate( - 'model' => model, - # 500 tokens: reasoning models (e.g. gpt-oss) consume tokens on chain-of-thought - # before producing content; 50 is too small and yields an empty content field. - 'max_tokens' => 500, - 'messages' => [{ 'role' => 'user', 'content' => 'Say hello in five words.' }] - ) - resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) } - raise Error, "LiteLLM returned HTTP #{resp.code}: #{resp.body}" unless resp.code == '200' - - text = JSON.parse(resp.body).fetch('content', []).find { |b| b['type'] == 'text' }&.dig('text').to_s.strip - info " LiteLLM response: #{text}" - rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e - raise Error, "Cannot reach LiteLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})" - end - # Sends a single OpenAI chat completion request and returns the reply text. def vllm_chat(host, port, model, prompt) uri = URI("http://#{host}:#{port}/v1/chat/completions") @@ -2547,8 +2394,8 @@ module HyperstackVM OptionParser.new do |o| o.on('--replace', 'Delete the tracked VM before creating a new one') { opts[:replace] = true } o.on('--dry-run', 'Print the create plan without creating a VM') { opts[:dry_run] = true } - o.on('--vllm', 'Enable vLLM+LiteLLM setup (overrides config)') { opts[:install_vllm] = true } - o.on('--no-vllm', 'Disable vLLM+LiteLLM setup (overrides config)') { opts[:install_vllm] = false } + o.on('--vllm', 'Enable vLLM setup (overrides config)') { opts[:install_vllm] = true } + o.on('--no-vllm', 'Disable vLLM setup (overrides config)') { opts[:install_vllm] = false } o.on('--ollama', 'Enable Ollama setup (overrides config)') { opts[:install_ollama] = true } o.on('--no-ollama', 'Disable Ollama setup (overrides config)') { opts[:install_ollama] = false } o.on('--model PRESET', 'Use a named vLLM preset at create time') { |v| opts[:vllm_preset] = v } if include_model_preset diff --git a/pi/agent/extensions/btw/README.md b/pi/agent/extensions/btw/README.md index cf39e1c..61092ae 100644 --- a/pi/agent/extensions/btw/README.md +++ b/pi/agent/extensions/btw/README.md @@ -2,7 +2,7 @@ Ephemeral side questions for Pi. -This extension adds `/btw`, modeled after Claude Code's side-question flow: +This extension adds `/btw`, modeled after Pi's side-question flow: - it uses the current branch conversation as context - it asks a separate one-shot question with the current model diff --git a/vllm-setup.txt b/vllm-setup.txt index 9ea44a7..cb64432 100644 --- a/vllm-setup.txt +++ b/vllm-setup.txt @@ -1,22 +1,16 @@ -# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM +# vLLM Setup for Hyperstack VM # # This document describes the full deployment of qwen3-coder-next (AWQ 4-bit) -# via vLLM with a LiteLLM proxy for Claude Code compatibility. +# via vLLM exposed directly on the OpenAI-compatible API. # # Architecture: # -# Claude Code (earth) Hyperstack VM (A100 80GB) +# Pi (earth) Hyperstack VM (A100 80GB) # ┌─────────────┐ ┌──────────────────────────────┐ -# │ claude CLI │── Anthropic API ──> │ LiteLLM proxy (:4000) │ -# │ │ /v1/messages │ translates Anthropic → │ -# │ │ via WireGuard wg1 │ OpenAI chat completions │ -# └─────────────┘ │ │ │ -# │ ▼ │ -# OpenCode (earth) │ vLLM engine (:11434) │ -# ┌─────────────┐ │ /v1/chat/completions │ -# │ opencode │── OpenAI API ──────> │ FlashAttention v2 │ -# │ │ /v1/chat/completions│ prefix caching │ -# └─────────────┘ │ bullpoint/Qwen3-Coder- │ +# │ pi │── OpenAI API ──────> │ vLLM engine (:11434) │ +# │ │ /v1/chat/completions│ FlashAttention v2 │ +# └─────────────┘ via WireGuard wg1 │ prefix caching │ +# │ bullpoint/Qwen3-Coder- │ # │ Next-AWQ-4bit (45GB) │ # └──────────────────────────────┘ # @@ -27,12 +21,6 @@ # - Chunked prefill: can interleave prefill and decode # - Marlin kernels for AWQ MoE quantization # -# Why LiteLLM: -# - Claude Code speaks Anthropic Messages API (/v1/messages) only -# - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only -# - LiteLLM translates between them, mapping Claude model names to the -# actual vLLM model -# # Model details: # - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace) # - Architecture: MoE, 80B total params, 3B active per token @@ -54,8 +42,7 @@ # # Ports: # 11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat) -# 4000/tcp - LiteLLM Anthropic-compatible proxy -# Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet) +# Restricted to 192.168.3.0/24 (WireGuard wg1 subnet) # =========================================================================== # STEP 1: Prerequisites @@ -130,132 +117,21 @@ # docker logs -f vllm_qwen3 2>&1 | grep "Engine 000" # =========================================================================== -# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code) -# =========================================================================== -# Install in a Python venv (Ubuntu 24.04 requires this): -# -# sudo apt-get install -y python3.12-venv -# sudo mkdir -p /ephemeral/litellm-env -# sudo chown ubuntu:ubuntu /ephemeral/litellm-env -# python3 -m venv /ephemeral/litellm-env -# /ephemeral/litellm-env/bin/pip install "litellm[proxy]" -# -# Write config file: -# -# sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML" -# model_list: -# - model_name: "claude-sonnet-4-20250514" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# - model_name: "claude-opus-4-20250514" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# - model_name: "claude-opus-4-6-20260604" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# - model_name: "claude-haiku-3-5-20241022" -# litellm_params: -# model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit" -# api_base: "http://localhost:11434/v1" -# api_key: "EMPTY" -# -# litellm_settings: -# drop_params: true -# -# general_settings: -# master_key: "sk-litellm-master" -# YAML -# -# Config notes: -# - model_name values must match what Claude Code sends (Claude model IDs) -# - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions -# (not /v1/responses which vLLM doesn't fully support for complex messages) -# - drop_params: true — silently drops Claude-specific parameters like -# context_management that vLLM doesn't understand -# - master_key is the API key clients must send -# - Add new model_name entries when Anthropic releases new model IDs -# -# Start LiteLLM: -# -# nohup /ephemeral/litellm-env/bin/litellm \ -# --config /ephemeral/litellm-config.yaml \ -# --host 0.0.0.0 \ -# --port 4000 \ -# > /ephemeral/litellm.log 2>&1 & -# -# Verify: -# curl -s http://localhost:4000/v1/messages \ -# -H "Content-Type: application/json" \ -# -H "x-api-key: sk-litellm-master" \ -# -H "anthropic-version: 2023-06-01" \ -# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50, -# "messages":[{"role":"user","content":"Hello"}]}' -# -# For production, create a systemd service instead of nohup: -# -# sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT" -# [Unit] -# Description=LiteLLM Proxy -# After=network.target docker.service -# Requires=docker.service -# -# [Service] -# Type=simple -# User=ubuntu -# ExecStart=/ephemeral/litellm-env/bin/litellm \ -# --config /ephemeral/litellm-config.yaml \ -# --host 0.0.0.0 --port 4000 -# Restart=always -# RestartSec=5 -# -# [Install] -# WantedBy=multi-user.target -# UNIT -# -# sudo systemctl daemon-reload -# sudo systemctl enable --now litellm - -# =========================================================================== -# STEP 5: Firewall rules +# STEP 4: Firewall rules # =========================================================================== # Allow access from WireGuard subnet only: # # sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \ # comment 'vLLM via wg1' -# sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \ -# comment 'LiteLLM proxy via wg1' - # =========================================================================== -# STEP 6: Client configuration (on earth / local machine) +# STEP 5: Client configuration (on earth / local machine) # =========================================================================== # -# --- Claude Code --- -# Launch with environment variables pointing at LiteLLM proxy: -# -# ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ -# ANTHROPIC_API_KEY=sk-litellm-master \ -# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions -# -# Fish shell alias (add to ~/.config/fish/config.fish): -# -# alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \ -# ANTHROPIC_API_KEY=sk-litellm-master \ -# claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions' -# -# --- OpenCode --- -# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively): +# Launch Pi or any OpenAI-compatible client directly against vLLM: # # OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \ # OPENAI_API_KEY=EMPTY \ -# opencode -# -# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit +# pi # =========================================================================== # STEP 7: Monitoring & troubleshooting @@ -267,8 +143,7 @@ # - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe # - GPU KV cache usage: % of KV cache memory in use (proportional to # active context length vs max capacity) -# - Prefix cache hit rate: % of prompt tokens served from cache (0% for -# Claude Code, higher for OpenCode) +# - Prefix cache hit rate: % of prompt tokens served from cache # - Running/Waiting: active and queued request counts # # Follow live (all stats): @@ -292,9 +167,6 @@ # Useful for periodic checks without following the log: # docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000" # -# --- LiteLLM proxy log --- -# tail -f /ephemeral/litellm.log -# # --- GPU hardware stats --- # Snapshot: # nvidia-smi @@ -310,8 +182,7 @@ # Decode throughput: 40-99 tok/s (varies with output length per sample) # KV cache usage: 0-5% for short conversations, grows with context # (100% = 298k tokens, at which point requests queue) -# Prefix cache hit: 0% for Claude Code (expected, it mutates prompt prefix) -# >50% for OpenCode after a few turns +# Prefix cache hit: depends on prompt reuse; higher is better # Temperature: 44-60C under load, <45C idle # Power: 70W idle, 230-240W under load, 300W max # @@ -326,24 +197,10 @@ # 1. OOM on startup with --max-model-len 262144 # → Reduce to 131072 or 65536 # -# 2. "model does not exist" from vLLM -# → Model name in LiteLLM config must exactly match HuggingFace repo name -# -# 3. LiteLLM returns UnsupportedParamsError -# → Ensure drop_params: true is in litellm_settings -# -# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions -# → Use "hosted_vllm/" prefix in model field, not "openai/" -# -# 5. Claude Code "Auth conflict" warning -# → Run `claude /logout` first to clear the claude.ai session token, -# then re-launch with ANTHROPIC_API_KEY=sk-litellm-master -# -# 6. Prefix cache hit rate stays at 0% -# → Normal for Claude Code (it mutates the prompt prefix each turn) -# → OpenCode should show increasing cache hit rates after a few turns +# 2. Prefix cache hit rate stays at 0% +# → Normal when prompts vary heavily turn-to-turn # -# 7. vLLM container won't start (CUDA version mismatch) +# 3. vLLM container won't start (CUDA version mismatch) # → Check driver version: nvidia-smi # → vLLM requires CUDA >= 12.x and driver >= 535 @@ -402,19 +259,6 @@ # --host 0.0.0.0 \ # --port 11434 # -# --- Update LiteLLM config to match --- -# After switching models, update the model field in litellm-config.yaml -# to match the new HuggingFace model name: -# -# model: "hosted_vllm/<new-model-name>" -# -# Then restart LiteLLM: -# pkill -f litellm -# nohup /ephemeral/litellm-env/bin/litellm \ -# --config /ephemeral/litellm-config.yaml \ -# --host 0.0.0.0 --port 4000 \ -# > /ephemeral/litellm.log 2>&1 & -# # --- Finding models --- # Search HuggingFace for vLLM-compatible quantized models: # https://huggingface.co/models?search=<model-name>+awq @@ -454,14 +298,6 @@ # "messages":[{"role":"user","content":"Hello"}], # "max_tokens":50}' # -# Test via LiteLLM (Anthropic API): -# curl -s http://localhost:4000/v1/messages \ -# -H "Content-Type: application/json" \ -# -H "x-api-key: sk-litellm-master" \ -# -H "anthropic-version: 2023-06-01" \ -# -d '{"model":"claude-opus-4-6-20260604","max_tokens":50, -# "messages":[{"role":"user","content":"Hello"}]}' - # =========================================================================== # Performance characteristics (A100 80GB PCIe, single GPU) # =========================================================================== @@ -472,7 +308,7 @@ # vLLM decode throughput: 40-99 tok/s (memory-bandwidth limited) # Per-turn latency: ~10-15s (small prompts, early conversation) # KV cache usage: 2-5% for typical coding sessions -# Prefix cache hit rate: 0% (Claude Code), expected >50% (OpenCode) +# Prefix cache hit rate: workload-dependent # # Comparison with Ollama on same hardware (A100 80GB PCIe): # @@ -482,6 +318,5 @@ # Decode throughput | ~40 tok/s | 40-99 tok/s # Per-turn latency | ~28s (32k ctx) | ~10-15s # Context window | 32k (was truncating) | 262k (full, no truncation) -# Prefix cache (Claude) | 0% always | 0% always -# Prefix cache (OpenCode)| 85-95% when warm | expected similar or better +# Prefix cache | workload-dependent | workload-dependent # VRAM usage | 52-61 GiB | 75 GiB (more KV cache) |
