Remove LiteLLM and Claude Code repo references (task 301)

author: Paul Buetow <paul@buetow.org> 2026-03-21 10:49:35 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-21 10:49:35 +0200
commit: ea0f9f7f51b32f0c392f75aa0cc3231211f54757 (patch)
tree: 378d01dbc87dc0ef9f4fbd6ec7788e0a62f66876
parent: 4baa087445a11b856139f55adab262fa97384033 (diff)
7 files changed, 41 insertions, 421 deletions
diff --git a/README.md b/README.md
index cba656c..93ddc58 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # hyperstack
 
-Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, vLLM inference, LiteLLM proxy.
+Automates Hyperstack GPU VM lifecycle: create, bootstrap, WireGuard tunnel, and vLLM inference.
 Runs two A100 VMs concurrently — each serving a different model — with [Pi](https://pi.dev) coding agents connected to each.
 
 ## Architecture
@@ -18,9 +18,6 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi](
   │ vLLM (:11434)                │    │ vLLM (:11434)                    │
   │   Nemotron-3-Super 120B      │    │   Qwen3-Coder-Next 80B (MoE)    │
   │   (hybrid Mamba+MoE, AWQ-4b) │    │   (AWQ-4bit)                     │
-  │                              │    │                                  │
-  │ LiteLLM (:4000)              │    │ LiteLLM (:4000)                  │
-  │   Anthropic API → OpenAI     │    │   Anthropic API → OpenAI         │
   └──────────────────────────────┘    └──────────────────────────────────┘
          ▲                                     ▲
          │ OpenAI /v1/chat/completions         │ OpenAI /v1/chat/completions
@@ -33,7 +30,7 @@ Runs two A100 VMs concurrently — each serving a different model — with [Pi](
 ```
 
 Both VMs share a single WireGuard interface (`wg1`) on the local machine.
-Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation.
+Each VM runs one vLLM model exposed directly to Pi over the OpenAI-compatible API.
 
 ## Prerequisites
 
@@ -49,7 +46,7 @@ Each VM runs one vLLM model and a LiteLLM proxy for Anthropic-API translation.
 ## Quickstart (two-VM setup)
 
 ```bash
-# Deploy both VMs in parallel, set up WireGuard + vLLM + LiteLLM (~10 min)
+# Deploy both VMs in parallel, set up WireGuard + vLLM (~10 min)
 ruby hyperstack.rb create-both
 
 # Verify both VMs are working
@@ -144,33 +141,6 @@ Available presets (both VMs share the same set):
 | `qwen3-32b` | Qwen3-32B (AWQ) | ~18 GB | 32K |
 | `devstral` | Devstral-Small-2507 (AWQ-4bit) | ~15 GB | 32K |
 
-## Using Claude Code with vLLM
-
-WireGuard (`wg1`) must be active before connecting.
-
-```bash
-ANTHROPIC_BASE_URL=http://hyperstack1.wg1:4000 \
-ANTHROPIC_API_KEY=sk-litellm-master \
-claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
-```
-
-If you see an **"Auth conflict"** warning, clear the saved claude.ai session first:
-
-```bash
-claude /logout
-```
-
-**Available model aliases** — all map to the same vLLM model on that VM:
-
-| Alias | Use case |
-|-------|----------|
-| `claude-opus-4-6-20260604` | Recommended (most future-proof) |
-| `claude-opus-4-20250514` | |
-| `claude-sonnet-4-20250514` | |
-| `claude-haiku-3-5-20241022` | |
-
-Add new Anthropic model IDs to `vllm.litellm_claude_model_names` in the TOML as they are released.
-
 ## CLI reference
 
 ```
@@ -182,13 +152,13 @@ Commands:
   delete       Destroy the tracked VM
   delete-both  Destroy both VM1 and VM2
   status       Show VM and WireGuard status
-  test         Run end-to-end inference tests (vLLM + LiteLLM)
+  test         Run end-to-end inference tests (vLLM)
   model switch <preset>  Hot-switch the running vLLM model
 
 create / create-both options:
   --replace          Delete existing tracked VM before creating
   --dry-run          Print the plan without making changes
-  --vllm / --no-vllm    Override config: enable/disable vLLM+LiteLLM setup
+  --vllm / --no-vllm    Override config: enable/disable vLLM setup
   --ollama / --no-ollama Override config: enable/disable Ollama setup
 ```
 
@@ -200,7 +170,7 @@ Key sections:
 | Section | Purpose |
 |---------|---------|
 | `[vm]` | Flavor, image, environment name |
-| `[vllm]` | Model, container settings, LiteLLM key and Claude aliases |
+| `[vllm]` | Model, container settings, and vLLM runtime options |
 | `[vllm.presets.*]` | Named model presets for hot-switching |
 | `[ollama]` | Ollama settings (disabled by default; set `install = true` to use instead) |
 | `[network]` | Ports, WireGuard subnet, allowed CIDRs |
@@ -222,8 +192,6 @@ ssh ubuntu@<vm-ip> 'docker logs -f vllm_nemotron_super 2>&1 | grep "Engine 000"'
 # GPU stats (every 5 s)
 ssh ubuntu@<vm-ip> 'nvidia-smi --query-gpu=temperature.gpu,utilization.gpu,power.draw,memory.used --format=csv -l 5'
 
-# LiteLLM proxy log
-ssh ubuntu@<vm-ip> 'sudo journalctl -fu litellm'
 ```
 
 Healthy baseline (A100 80GB PCIe):
@@ -234,4 +202,4 @@ Healthy baseline (A100 80GB PCIe):
 | Decode throughput | 40–99 tok/s |
 | KV cache usage | 2–5% for typical sessions |
 
-See `vllm-setup.txt` for detailed vLLM and LiteLLM setup notes, VRAM sizing guide, and troubleshooting.
+See `vllm-setup.txt` for detailed vLLM setup notes, VRAM sizing guide, and troubleshooting.
diff --git a/hyperstack-vm.toml b/hyperstack-vm.toml
index e82c97f..28de975 100644
--- a/hyperstack-vm.toml
+++ b/hyperstack-vm.toml
@@ -37,8 +37,6 @@ allowed_ssh_cidrs = ["auto"]
 allowed_wireguard_cidrs = ["auto"]
 # Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
 ollama_port = 11434
-# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
-litellm_port = 4000
 
 [bootstrap]
 enable_guest_bootstrap = true
@@ -56,7 +54,7 @@ num_parallel = 1
 context_length = 32768
 pull_models = ["qwen3-coder-next", "qwen3-coder:30b", "gpt-oss:20b", "gpt-oss:120b", "nemotron-3-super"]
 
-# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# vLLM serves one model via Docker on the OpenAI-compatible API.
 # Use --vllm / --no-vllm CLI flags to override install at runtime.
 [vllm]
 install = true
@@ -68,14 +66,6 @@ max_model_len = 262144
 gpu_memory_utilization = 0.92
 tensor_parallel_size = 1
 tool_call_parser = "qwen3_coder"
-# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
-litellm_master_key = "sk-litellm-master"
-litellm_claude_model_names = [
-  "claude-sonnet-4-20250514",
-  "claude-opus-4-20250514",
-  "claude-opus-4-6-20260604",
-  "claude-haiku-3-5-20241022"
-]
 
 # Named model presets for 'ruby hyperstack.rb model switch <name>'.
 # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
@@ -127,7 +117,7 @@ tool_call_parser = ""
 # OpenAI GPT-OSS 120B — powerful MoE (5.1B active / 117B total, MXFP4), ~65 GB on A100.
 # Hard architecture limit: max_position_embeddings=131072 in model config.json.
 # 131072 is the absolute ceiling — exceeding it causes NaN or CUDA OOB errors.
-# For sessions approaching this limit, start a fresh opencode conversation.
+# For sessions approaching this limit, start a fresh Pi conversation.
 # tool_call_parser = "" disables --enable-auto-tool-choice (same reason as gpt-oss-20b).
 [vllm.presets.gpt-oss-120b]
 model = "openai/gpt-oss-120b"
diff --git a/hyperstack-vm1.toml b/hyperstack-vm1.toml
index 1b116bd..6109472 100644
--- a/hyperstack-vm1.toml
+++ b/hyperstack-vm1.toml
@@ -41,8 +41,6 @@ allowed_ssh_cidrs = ["auto"]
 allowed_wireguard_cidrs = ["auto"]
 # Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
 ollama_port = 11434
-# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
-litellm_port = 4000
 
 [bootstrap]
 enable_guest_bootstrap = true
@@ -60,7 +58,7 @@ num_parallel = 1
 context_length = 32768
 pull_models = ["nemotron-3-super"]
 
-# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# vLLM serves one model via Docker on the OpenAI-compatible API.
 # VM1 defaults to nemotron-3-super; use 'model switch' to load any other preset.
 [vllm]
 install = true
@@ -75,14 +73,6 @@ tensor_parallel_size = 1
 tool_call_parser = "qwen3_xml"
 trust_remote_code = true
 extra_vllm_args = ["--reasoning-parser", "nemotron_v3"]
-# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
-litellm_master_key = "sk-litellm-master"
-litellm_claude_model_names = [
-  "claude-sonnet-4-20250514",
-  "claude-opus-4-20250514",
-  "claude-opus-4-6-20260604",
-  "claude-haiku-3-5-20241022"
-]
 
 # Named model presets for 'ruby hyperstack.rb --config hyperstack-vm1.toml model switch <name>'.
 # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
diff --git a/hyperstack-vm2.toml b/hyperstack-vm2.toml
index e8e9b00..202a340 100644
--- a/hyperstack-vm2.toml
+++ b/hyperstack-vm2.toml
@@ -41,8 +41,6 @@ allowed_ssh_cidrs = ["auto"]
 allowed_wireguard_cidrs = ["auto"]
 # Port 11434 is shared by both Ollama and vLLM for firewall compatibility.
 ollama_port = 11434
-# Port 4000: LiteLLM Anthropic-API proxy (used with vLLM).
-litellm_port = 4000
 
 [bootstrap]
 enable_guest_bootstrap = true
@@ -60,7 +58,7 @@ num_parallel = 1
 context_length = 32768
 pull_models = ["qwen3-coder-next"]
 
-# vLLM serves one model via Docker; LiteLLM translates Anthropic API → OpenAI.
+# vLLM serves one model via Docker on the OpenAI-compatible API.
 # VM2 defaults to qwen3-coder-next; use 'model switch' to load any other preset.
 [vllm]
 install = true
@@ -72,14 +70,6 @@ max_model_len = 262144
 gpu_memory_utilization = 0.92
 tensor_parallel_size = 1
 tool_call_parser = "qwen3_coder"
-# LiteLLM maps each entry to the vLLM model; add new Anthropic model IDs here.
-litellm_master_key = "sk-litellm-master"
-litellm_claude_model_names = [
-  "claude-sonnet-4-20250514",
-  "claude-opus-4-20250514",
-  "claude-opus-4-6-20260604",
-  "claude-haiku-3-5-20241022"
-]
 
 # Named model presets for 'ruby hyperstack.rb --config hyperstack-vm2.toml model switch <name>'.
 # Each preset overrides the matching [vllm] field; unset fields fall back to [vllm] defaults.
diff --git a/hyperstack.rb b/hyperstack.rb
index 7cd817d..a3af491 100755
--- a/hyperstack.rb
+++ b/hyperstack.rb
@@ -87,7 +87,6 @@ module HyperstackVM
         # Set to a different address (e.g. 192.168.3.3) for a second VM sharing the same wg1 tunnel.
         'wireguard_server_ip' => nil,
         'ollama_port' => 11_434,
-        'litellm_port' => 4_000,
         'allowed_ssh_cidrs' => ['auto'],
         'allowed_wireguard_cidrs' => ['auto']
       },
@@ -114,14 +113,7 @@ module HyperstackVM
         'max_model_len' => 262_144,
         'gpu_memory_utilization' => 0.92,
         'tensor_parallel_size' => 1,
-        'tool_call_parser' => 'qwen3_coder',
-        'litellm_claude_model_names' => %w[
-          claude-sonnet-4-20250514
-          claude-opus-4-20250514
-          claude-opus-4-6-20260604
-          claude-haiku-3-5-20241022
-        ],
-        'litellm_master_key' => 'sk-litellm-master'
+        'tool_call_parser' => 'qwen3_coder'
       },
       'wireguard' => {
         'auto_setup' => true,
@@ -338,10 +330,6 @@ module HyperstackVM
       Integer(fetch('network', 'ollama_port'))
     end
 
-    def litellm_port
-      Integer(fetch('network', 'litellm_port'))
-    end
-
     # Returns the server-side WireGuard IP for this VM.
     # Uses the explicitly configured address when set; otherwise derives it as subnet_base + 1.
     # Example: 192.168.3.0/24 → 192.168.3.1 (default VM1); VM2 sets wireguard_server_ip=192.168.3.3.
@@ -453,14 +441,6 @@ module HyperstackVM
       fetch('vllm', 'tool_call_parser')
     end
 
-    def litellm_claude_model_names
-      Array(fetch('vllm', 'litellm_claude_model_names')).map(&:to_s)
-    end
-
-    def litellm_master_key
-      fetch('vllm', 'litellm_master_key')
-    end
-
     # Whether to pass --trust-remote-code to vLLM for the default model.
     # Required for architectures not yet in the vLLM upstream registry (e.g. nemotron_h).
     def vllm_trust_remote_code
@@ -530,7 +510,6 @@ module HyperstackVM
       end
 
       rules << firewall_rule('tcp', ollama_port, wireguard_subnet) if include_ollama || include_vllm
-      rules << firewall_rule('tcp', litellm_port, wireguard_subnet) if include_vllm
       rules.uniq
     end
 
@@ -1081,8 +1060,6 @@ module HyperstackVM
         script << "sudo ufw allow #{@config.wireguard_udp_port}/udp comment 'WireGuard #{@config.local_interface_name}' >/dev/null 2>&1 || true"
         # Port 11434 is shared by Ollama and vLLM; open for both regardless of which is installed.
         script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.ollama_port} proto tcp comment 'Inference API (Ollama/vLLM) via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
-        # Port 4000: LiteLLM proxy (Anthropic API -> vLLM); open alongside the inference port.
-        script << "sudo ufw allow from #{Shellwords.escape(@config.wireguard_subnet)} to any port #{@config.litellm_port} proto tcp comment 'LiteLLM proxy via #{@config.local_interface_name}' >/dev/null 2>&1 || true"
       end
 
       if @config.configure_ollama_host?
@@ -1248,71 +1225,6 @@ module HyperstackVM
       script.join("\n")
     end
 
-    def litellm_install_script(model_override: nil)
-      port = @config.litellm_port
-      model = model_override || @config.vllm_model
-
-      script = []
-      script << 'set -euo pipefail'
-      script << 'sudo apt-get install -y python3.12-venv'
-      script << 'sudo mkdir -p /ephemeral/litellm-env'
-      script << 'sudo chown ubuntu:ubuntu /ephemeral/litellm-env'
-      script << 'python3 -m venv /ephemeral/litellm-env'
-      script << '/ephemeral/litellm-env/bin/pip install --quiet "litellm[proxy]"'
-      script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
-      script << 'model_list:'
-      script.concat(litellm_model_entries(model))
-      script << ''
-      script << 'litellm_settings:'
-      script << '  drop_params: true'
-      script << ''
-      script << 'general_settings:'
-      script << "  master_key: \"#{@config.litellm_master_key}\""
-      script << 'LITELLM_YAML'
-      script << "sudo tee /etc/systemd/system/litellm.service > /dev/null << 'LITELLM_UNIT'"
-      script << '[Unit]'
-      script << 'Description=LiteLLM Proxy'
-      script << 'After=network.target docker.service'
-      script << 'Requires=docker.service'
-      script << ''
-      script << '[Service]'
-      script << 'Type=simple'
-      script << 'User=ubuntu'
-      script << "ExecStart=/ephemeral/litellm-env/bin/litellm --config /ephemeral/litellm-config.yaml --host 0.0.0.0 --port #{port}"
-      script << 'Restart=always'
-      script << 'RestartSec=5'
-      script << ''
-      script << '[Install]'
-      script << 'WantedBy=multi-user.target'
-      script << 'LITELLM_UNIT'
-      script << 'sudo systemctl daemon-reload'
-      script << 'sudo systemctl enable --now litellm'
-      script << 'sleep 5'
-      script << 'systemctl is-active --quiet litellm'
-      script << 'echo litellm-install-ok'
-      script.join("\n")
-    end
-
-    def litellm_reload_script(model)
-      script = []
-      script << 'set -euo pipefail'
-      script << "sudo tee /ephemeral/litellm-config.yaml > /dev/null << 'LITELLM_YAML'"
-      script << 'model_list:'
-      script.concat(litellm_model_entries(model))
-      script << ''
-      script << 'litellm_settings:'
-      script << '  drop_params: true'
-      script << ''
-      script << 'general_settings:'
-      script << "  master_key: \"#{@config.litellm_master_key}\""
-      script << 'LITELLM_YAML'
-      script << 'sudo systemctl restart litellm'
-      script << 'sleep 3'
-      script << 'systemctl is-active --quiet litellm'
-      script << 'echo litellm-reload-ok'
-      script.join("\n")
-    end
-
     private
 
     def normalized_model_list(models)
@@ -1324,19 +1236,6 @@ module HyperstackVM
       end
     end
 
-    def litellm_model_entries(model)
-      vllm_port = @config.ollama_port
-
-      @config.litellm_claude_model_names.flat_map do |name|
-        [
-          "  - model_name: \"#{name}\"",
-          '    litellm_params:',
-          "      model: \"hosted_vllm/#{model}\"",
-          "      api_base: \"http://localhost:#{vllm_port}/v1\"",
-          '      api_key: "EMPTY"'
-        ]
-      end
-    end
   end
 
   class RemoteProvisioner
@@ -1390,22 +1289,8 @@ module HyperstackVM
       raise Error, "vLLM install failed: #{output.strip}" unless status.success?
     end
 
-    def install_litellm(host, model:)
-      info "Setting up LiteLLM Anthropic-API proxy on #{host}..."
-      output, status = @ssh_stream_runner.call(host, @scripts.litellm_install_script(model_override: model))
-      raise Error, "LiteLLM install failed: #{output.strip}" unless status.success?
-    end
-
-    def reload_litellm(host, model)
-      info "Reloading LiteLLM proxy config for #{model}..."
-      output, status = @ssh_stream_runner.call(host, @scripts.litellm_reload_script(model))
-      raise Error, "LiteLLM reload failed: #{output.strip}" unless status.success?
-    end
-
     def setup_vllm_stack(host, preset_config: nil)
       install_vllm(host, preset_config: preset_config)
-      model = preset_config&.dig('model') || @config.vllm_model
-      install_litellm(host, model: model)
     end
 
     private
@@ -1598,7 +1483,7 @@ module HyperstackVM
     end
 
     # Switches the running VM to a different named model preset.
-    # Stops the old container, starts the new one, and hot-reloads LiteLLM config.
+    # Stops the old container, then starts the new vLLM container in its place.
     def switch_model(preset_name:, dry_run: false)
       preset = @config.vllm_preset(preset_name) # raises if unknown
       state  = @state_store.load
@@ -1633,10 +1518,6 @@ module HyperstackVM
       # surprise multi-GB download if the upstream image was updated.
       @provisioner.install_vllm(host, preset_config: preset, pull_image: false)
 
-      # Hot-reload LiteLLM: rewrite config for the new model and restart the service.
-      # Skips venv/apt install since those are already in place.
-      @provisioner.reload_litellm(host, preset['model'])
-
       state['vllm_model']          = preset['model']
       state['vllm_container_name'] = new_container
       state['vllm_preset']         = preset_name
@@ -1650,7 +1531,7 @@ module HyperstackVM
       info "Run 'ruby hyperstack.rb test' to verify."
     end
 
-    # Runs end-to-end inference tests against vLLM and LiteLLM over WireGuard.
+    # Runs end-to-end inference tests against the active inference services over WireGuard.
     # Requires wg1 to be active and the VM to be fully provisioned.
     def test
       state = @state_store.load
@@ -1663,7 +1544,6 @@ module HyperstackVM
 
       if vllm_enabled
         test_vllm(wg_ip)
-        test_litellm(wg_ip)
       end
 
       info "  Ollama test: connect via SSH and run 'ollama list' to verify models." if ollama_enabled
@@ -1731,7 +1611,7 @@ module HyperstackVM
         @state_store.save(state)
       end
 
-      # Set up vLLM (Docker container) + LiteLLM (Anthropic-API proxy) after
+      # Set up vLLM after
       # the tunnel is up so that model-download progress is visible locally.
       if vllm_setup_needed?(state)
         preset_cfg = effective_vllm_preset_config
@@ -1755,9 +1635,8 @@ module HyperstackVM
       return unless effective_vllm?
 
       wg_ip = @config.wireguard_gateway_hostname
-      info "Run 'ruby hyperstack.rb test' to verify vLLM and LiteLLM."
+      info "Run 'ruby hyperstack.rb test' to verify vLLM."
       info "  vLLM:    http://#{wg_ip}:#{@config.ollama_port}/v1/models"
-      info "  LiteLLM: http://#{wg_ip}:#{@config.litellm_port}/v1/messages"
     end
 
     def build_create_payload(vm_name, resolved)
@@ -2138,9 +2017,9 @@ module HyperstackVM
     end
 
     def service_mode_summary(vllm_enabled:, ollama_enabled:)
-      return 'vLLM+LiteLLM enabled, Ollama enabled' if vllm_enabled && ollama_enabled
-      return 'vLLM+LiteLLM enabled, Ollama disabled' if vllm_enabled
-      return 'Ollama enabled, vLLM+LiteLLM disabled' if ollama_enabled
+      return 'vLLM enabled, Ollama enabled' if vllm_enabled && ollama_enabled
+      return 'vLLM enabled, Ollama disabled' if vllm_enabled
+      return 'Ollama enabled, vLLM disabled' if ollama_enabled
 
       'All inference services disabled'
     end
@@ -2204,8 +2083,6 @@ module HyperstackVM
         preset_note = @effective_vllm_preset ? " (preset: #{@effective_vllm_preset})" : ''
         info "vLLM will be installed: #{vllm_m}#{preset_note}"
         info "  Container: #{vllm_cname}, port #{@config.ollama_port}, max_model_len #{vllm_maxlen}"
-        info "LiteLLM proxy will be installed on port #{@config.litellm_port}"
-        info "  Claude model aliases: #{@config.litellm_claude_model_names.join(', ')}"
       end
       if @config.wireguard_auto_setup?
         info "WireGuard auto-setup script: #{@config.wireguard_setup_script} <vm_public_ip>"
@@ -2233,7 +2110,6 @@ module HyperstackVM
       end
       if vllm_setup_needed?(state)
         info "vLLM would be installed: #{@config.vllm_model}"
-        info "LiteLLM proxy would be installed on port #{@config.litellm_port}"
       end
       if wireguard_setup_needed?(state)
         info "WireGuard auto-setup script would run: #{@config.wireguard_setup_script} #{state['public_ip'] || '<pending-public-ip>'}"
@@ -2325,35 +2201,6 @@ module HyperstackVM
       raise Error, "Cannot reach vLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
     end
 
-    # Tests the LiteLLM proxy using the Anthropic Messages API format,
-    # which is what Claude Code sends when pointed at a custom base URL.
-    def test_litellm(wg_ip)
-      port  = @config.litellm_port
-      model = @config.litellm_claude_model_names.first
-      key   = @config.litellm_master_key
-
-      info "  Testing LiteLLM proxy at http://#{wg_ip}:#{port}/v1/messages..."
-      uri = URI("http://#{wg_ip}:#{port}/v1/messages")
-      req = Net::HTTP::Post.new(uri)
-      req['Content-Type'] = 'application/json'
-      req['x-api-key'] = key
-      req['anthropic-version'] = '2023-06-01'
-      req.body = JSON.generate(
-        'model' => model,
-        # 500 tokens: reasoning models (e.g. gpt-oss) consume tokens on chain-of-thought
-        # before producing content; 50 is too small and yields an empty content field.
-        'max_tokens' => 500,
-        'messages' => [{ 'role' => 'user', 'content' => 'Say hello in five words.' }]
-      )
-      resp = Net::HTTP.start(uri.host, uri.port, open_timeout: 10, read_timeout: 120) { |h| h.request(req) }
-      raise Error, "LiteLLM returned HTTP #{resp.code}: #{resp.body}" unless resp.code == '200'
-
-      text = JSON.parse(resp.body).fetch('content', []).find { |b| b['type'] == 'text' }&.dig('text').to_s.strip
-      info "    LiteLLM response: #{text}"
-    rescue Errno::ECONNREFUSED, Errno::EHOSTUNREACH, SocketError => e
-      raise Error, "Cannot reach LiteLLM at #{wg_ip}:#{port} — is WireGuard (wg1) active? (#{e.message})"
-    end
-
     # Sends a single OpenAI chat completion request and returns the reply text.
     def vllm_chat(host, port, model, prompt)
       uri = URI("http://#{host}:#{port}/v1/chat/completions")
@@ -2547,8 +2394,8 @@ module HyperstackVM
       OptionParser.new do |o|
         o.on('--replace',    'Delete the tracked VM before creating a new one')      { opts[:replace] = true }
         o.on('--dry-run',    'Print the create plan without creating a VM')          { opts[:dry_run] = true }
-        o.on('--vllm',       'Enable vLLM+LiteLLM setup (overrides config)')         { opts[:install_vllm] = true }
-        o.on('--no-vllm',    'Disable vLLM+LiteLLM setup (overrides config)')        { opts[:install_vllm] = false }
+        o.on('--vllm',       'Enable vLLM setup (overrides config)')                  { opts[:install_vllm] = true }
+        o.on('--no-vllm',    'Disable vLLM setup (overrides config)')                 { opts[:install_vllm] = false }
         o.on('--ollama',     'Enable Ollama setup (overrides config)')               { opts[:install_ollama] = true }
         o.on('--no-ollama',  'Disable Ollama setup (overrides config)')              { opts[:install_ollama] = false }
         o.on('--model PRESET', 'Use a named vLLM preset at create time') { |v| opts[:vllm_preset] = v } if include_model_preset
diff --git a/pi/agent/extensions/btw/README.md b/pi/agent/extensions/btw/README.md
index cf39e1c..61092ae 100644
--- a/pi/agent/extensions/btw/README.md
+++ b/pi/agent/extensions/btw/README.md
@@ -2,7 +2,7 @@
 
 Ephemeral side questions for Pi.
 
-This extension adds `/btw`, modeled after Claude Code's side-question flow:
+This extension adds `/btw`, modeled after Pi's side-question flow:
 
 - it uses the current branch conversation as context
 - it asks a separate one-shot question with the current model
diff --git a/vllm-setup.txt b/vllm-setup.txt
index 9ea44a7..cb64432 100644
--- a/vllm-setup.txt
+++ b/vllm-setup.txt
@@ -1,22 +1,16 @@
-# vLLM + LiteLLM + Claude Code Setup for Hyperstack VM
+# vLLM Setup for Hyperstack VM
 #
 # This document describes the full deployment of qwen3-coder-next (AWQ 4-bit)
-# via vLLM with a LiteLLM proxy for Claude Code compatibility.
+# via vLLM exposed directly on the OpenAI-compatible API.
 #
 # Architecture:
 #
-#   Claude Code (earth)                    Hyperstack VM (A100 80GB)
+#   Pi (earth)                            Hyperstack VM (A100 80GB)
 #   ┌─────────────┐                       ┌──────────────────────────────┐
-#   │ claude CLI   │── Anthropic API ──>  │ LiteLLM proxy (:4000)       │
-#   │              │   /v1/messages        │   translates Anthropic →    │
-#   │              │   via WireGuard wg1   │   OpenAI chat completions   │
-#   └─────────────┘                       │         │                    │
-#                                         │         ▼                    │
-#   OpenCode (earth)                      │ vLLM engine (:11434)        │
-#   ┌─────────────┐                       │   /v1/chat/completions      │
-#   │ opencode     │── OpenAI API ──────> │   FlashAttention v2         │
-#   │              │   /v1/chat/completions│   prefix caching            │
-#   └─────────────┘                       │   bullpoint/Qwen3-Coder-    │
+#   │ pi          │── OpenAI API ──────>  │ vLLM engine (:11434)        │
+#   │             │   /v1/chat/completions│   FlashAttention v2         │
+#   └─────────────┘   via WireGuard wg1   │   prefix caching            │
+#                                         │   bullpoint/Qwen3-Coder-    │
 #                                         │     Next-AWQ-4bit (45GB)    │
 #                                         └──────────────────────────────┘
 #
@@ -27,12 +21,6 @@
 #   - Chunked prefill: can interleave prefill and decode
 #   - Marlin kernels for AWQ MoE quantization
 #
-# Why LiteLLM:
-#   - Claude Code speaks Anthropic Messages API (/v1/messages) only
-#   - vLLM speaks OpenAI Chat Completions API (/v1/chat/completions) only
-#   - LiteLLM translates between them, mapping Claude model names to the
-#     actual vLLM model
-#
 # Model details:
 #   - Name: bullpoint/Qwen3-Coder-Next-AWQ-4bit (HuggingFace)
 #   - Architecture: MoE, 80B total params, 3B active per token
@@ -54,8 +42,7 @@
 #
 # Ports:
 #   11434/tcp - vLLM OpenAI-compatible API (reuses Ollama port for firewall compat)
-#   4000/tcp  - LiteLLM Anthropic-compatible proxy
-#   Both restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
+#   Restricted to 192.168.3.0/24 (WireGuard wg1 subnet)
 
 # ===========================================================================
 # STEP 1: Prerequisites
@@ -130,132 +117,21 @@
 #   docker logs -f vllm_qwen3 2>&1 | grep "Engine 000"
 
 # ===========================================================================
-# STEP 4: LiteLLM proxy (Anthropic API translation for Claude Code)
-# ===========================================================================
-# Install in a Python venv (Ubuntu 24.04 requires this):
-#
-#   sudo apt-get install -y python3.12-venv
-#   sudo mkdir -p /ephemeral/litellm-env
-#   sudo chown ubuntu:ubuntu /ephemeral/litellm-env
-#   python3 -m venv /ephemeral/litellm-env
-#   /ephemeral/litellm-env/bin/pip install "litellm[proxy]"
-#
-# Write config file:
-#
-#   sudo tee /ephemeral/litellm-config.yaml > /dev/null << "YAML"
-#   model_list:
-#     - model_name: "claude-sonnet-4-20250514"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#     - model_name: "claude-opus-4-20250514"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#     - model_name: "claude-opus-4-6-20260604"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#     - model_name: "claude-haiku-3-5-20241022"
-#       litellm_params:
-#         model: "hosted_vllm/bullpoint/Qwen3-Coder-Next-AWQ-4bit"
-#         api_base: "http://localhost:11434/v1"
-#         api_key: "EMPTY"
-#
-#   litellm_settings:
-#     drop_params: true
-#
-#   general_settings:
-#     master_key: "sk-litellm-master"
-#   YAML
-#
-# Config notes:
-#   - model_name values must match what Claude Code sends (Claude model IDs)
-#   - "hosted_vllm/" prefix forces LiteLLM to use /v1/chat/completions
-#     (not /v1/responses which vLLM doesn't fully support for complex messages)
-#   - drop_params: true — silently drops Claude-specific parameters like
-#     context_management that vLLM doesn't understand
-#   - master_key is the API key clients must send
-#   - Add new model_name entries when Anthropic releases new model IDs
-#
-# Start LiteLLM:
-#
-#   nohup /ephemeral/litellm-env/bin/litellm \
-#     --config /ephemeral/litellm-config.yaml \
-#     --host 0.0.0.0 \
-#     --port 4000 \
-#     > /ephemeral/litellm.log 2>&1 &
-#
-# Verify:
-#   curl -s http://localhost:4000/v1/messages \
-#     -H "Content-Type: application/json" \
-#     -H "x-api-key: sk-litellm-master" \
-#     -H "anthropic-version: 2023-06-01" \
-#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-#          "messages":[{"role":"user","content":"Hello"}]}'
-#
-# For production, create a systemd service instead of nohup:
-#
-#   sudo tee /etc/systemd/system/litellm.service > /dev/null << "UNIT"
-#   [Unit]
-#   Description=LiteLLM Proxy
-#   After=network.target docker.service
-#   Requires=docker.service
-#
-#   [Service]
-#   Type=simple
-#   User=ubuntu
-#   ExecStart=/ephemeral/litellm-env/bin/litellm \
-#     --config /ephemeral/litellm-config.yaml \
-#     --host 0.0.0.0 --port 4000
-#   Restart=always
-#   RestartSec=5
-#
-#   [Install]
-#   WantedBy=multi-user.target
-#   UNIT
-#
-#   sudo systemctl daemon-reload
-#   sudo systemctl enable --now litellm
-
-# ===========================================================================
-# STEP 5: Firewall rules
+# STEP 4: Firewall rules
 # ===========================================================================
 # Allow access from WireGuard subnet only:
 #
 #   sudo ufw allow from 192.168.3.0/24 to any port 11434 proto tcp \
 #     comment 'vLLM via wg1'
-#   sudo ufw allow from 192.168.3.0/24 to any port 4000 proto tcp \
-#     comment 'LiteLLM proxy via wg1'
-
 # ===========================================================================
-# STEP 6: Client configuration (on earth / local machine)
+# STEP 5: Client configuration (on earth / local machine)
 # ===========================================================================
 #
-# --- Claude Code ---
-# Launch with environment variables pointing at LiteLLM proxy:
-#
-#   ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-#   ANTHROPIC_API_KEY=sk-litellm-master \
-#   claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions
-#
-# Fish shell alias (add to ~/.config/fish/config.fish):
-#
-#   alias claude-local='ANTHROPIC_BASE_URL=http://192.168.3.1:4000 \
-#     ANTHROPIC_API_KEY=sk-litellm-master \
-#     claude --model claude-opus-4-6-20260604 --dangerously-skip-permissions'
-#
-# --- OpenCode ---
-# Connects directly to vLLM (no LiteLLM needed, speaks OpenAI natively):
+# Launch Pi or any OpenAI-compatible client directly against vLLM:
 #
 #   OPENAI_BASE_URL=http://192.168.3.1:11434/v1 \
 #   OPENAI_API_KEY=EMPTY \
-#   opencode
-#
-# Model name in OpenCode config: bullpoint/Qwen3-Coder-Next-AWQ-4bit
+#   pi
 
 # ===========================================================================
 # STEP 7: Monitoring & troubleshooting
@@ -267,8 +143,7 @@
 #   - Avg generation throughput: decode speed (tokens/s), ~40-99 on A100 PCIe
 #   - GPU KV cache usage:        % of KV cache memory in use (proportional to
 #                                 active context length vs max capacity)
-#   - Prefix cache hit rate:     % of prompt tokens served from cache (0% for
-#                                 Claude Code, higher for OpenCode)
+#   - Prefix cache hit rate:     % of prompt tokens served from cache
 #   - Running/Waiting:           active and queued request counts
 #
 # Follow live (all stats):
@@ -292,9 +167,6 @@
 # Useful for periodic checks without following the log:
 #   docker logs --since 1m vllm_qwen3 2>&1 | grep "Engine 000"
 #
-# --- LiteLLM proxy log ---
-#   tail -f /ephemeral/litellm.log
-#
 # --- GPU hardware stats ---
 # Snapshot:
 #   nvidia-smi
@@ -310,8 +182,7 @@
 #   Decode throughput:    40-99 tok/s (varies with output length per sample)
 #   KV cache usage:       0-5% for short conversations, grows with context
 #                         (100% = 298k tokens, at which point requests queue)
-#   Prefix cache hit:     0% for Claude Code (expected, it mutates prompt prefix)
-#                         >50% for OpenCode after a few turns
+#   Prefix cache hit:     depends on prompt reuse; higher is better
 #   Temperature:          44-60C under load, <45C idle
 #   Power:                70W idle, 230-240W under load, 300W max
 #
@@ -326,24 +197,10 @@
 # 1. OOM on startup with --max-model-len 262144
 #    → Reduce to 131072 or 65536
 #
-# 2. "model does not exist" from vLLM
-#    → Model name in LiteLLM config must exactly match HuggingFace repo name
-#
-# 3. LiteLLM returns UnsupportedParamsError
-#    → Ensure drop_params: true is in litellm_settings
-#
-# 4. LiteLLM routes to /v1/responses instead of /v1/chat/completions
-#    → Use "hosted_vllm/" prefix in model field, not "openai/"
-#
-# 5. Claude Code "Auth conflict" warning
-#    → Run `claude /logout` first to clear the claude.ai session token,
-#      then re-launch with ANTHROPIC_API_KEY=sk-litellm-master
-#
-# 6. Prefix cache hit rate stays at 0%
-#    → Normal for Claude Code (it mutates the prompt prefix each turn)
-#    → OpenCode should show increasing cache hit rates after a few turns
+# 2. Prefix cache hit rate stays at 0%
+#    → Normal when prompts vary heavily turn-to-turn
 #
-# 7. vLLM container won't start (CUDA version mismatch)
+# 3. vLLM container won't start (CUDA version mismatch)
 #    → Check driver version: nvidia-smi
 #    → vLLM requires CUDA >= 12.x and driver >= 535
 
@@ -402,19 +259,6 @@
 #     --host 0.0.0.0 \
 #     --port 11434
 #
-# --- Update LiteLLM config to match ---
-# After switching models, update the model field in litellm-config.yaml
-# to match the new HuggingFace model name:
-#
-#   model: "hosted_vllm/<new-model-name>"
-#
-# Then restart LiteLLM:
-#   pkill -f litellm
-#   nohup /ephemeral/litellm-env/bin/litellm \
-#     --config /ephemeral/litellm-config.yaml \
-#     --host 0.0.0.0 --port 4000 \
-#     > /ephemeral/litellm.log 2>&1 &
-#
 # --- Finding models ---
 # Search HuggingFace for vLLM-compatible quantized models:
 #   https://huggingface.co/models?search=<model-name>+awq
@@ -454,14 +298,6 @@
 #          "messages":[{"role":"user","content":"Hello"}],
 #          "max_tokens":50}'
 #
-# Test via LiteLLM (Anthropic API):
-#   curl -s http://localhost:4000/v1/messages \
-#     -H "Content-Type: application/json" \
-#     -H "x-api-key: sk-litellm-master" \
-#     -H "anthropic-version: 2023-06-01" \
-#     -d '{"model":"claude-opus-4-6-20260604","max_tokens":50,
-#          "messages":[{"role":"user","content":"Hello"}]}'
-
 # ===========================================================================
 # Performance characteristics (A100 80GB PCIe, single GPU)
 # ===========================================================================
@@ -472,7 +308,7 @@
 #   vLLM decode throughput:     40-99 tok/s (memory-bandwidth limited)
 #   Per-turn latency:           ~10-15s (small prompts, early conversation)
 #   KV cache usage:             2-5% for typical coding sessions
-#   Prefix cache hit rate:      0% (Claude Code), expected >50% (OpenCode)
+#   Prefix cache hit rate:      workload-dependent
 #
 # Comparison with Ollama on same hardware (A100 80GB PCIe):
 #
@@ -482,6 +318,5 @@
 #   Decode throughput      | ~40 tok/s             | 40-99 tok/s
 #   Per-turn latency       | ~28s (32k ctx)        | ~10-15s
 #   Context window         | 32k (was truncating)  | 262k (full, no truncation)
-#   Prefix cache (Claude)  | 0% always             | 0% always
-#   Prefix cache (OpenCode)| 85-95% when warm      | expected similar or better
+#   Prefix cache           | workload-dependent    | workload-dependent
 #   VRAM usage             | 52-61 GiB             | 75 GiB (more KV cache)
author	Paul Buetow <paul@buetow.org>	2026-03-21 10:49:35 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-21 10:49:35 +0200
commit	ea0f9f7f51b32f0c392f75aa0cc3231211f54757 (patch)
tree	378d01dbc87dc0ef9f4fbd6ec7788e0a62f66876
parent	4baa087445a11b856139f55adab262fa97384033 (diff)