diff options
| author | Paul Buetow <paul@buetow.org> | 2026-03-22 08:34:28 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-03-22 08:34:28 +0200 |
| commit | 0e5dbef6b36b6e72fb9739b8de88cfdf2dbdf1ae (patch) | |
| tree | d8412139dd231d07dab1dcf9a6860e0af9b5ab2c /pi | |
| parent | f5c2125d1c1cbf3adde917747aba61cbc3a0f228 (diff) | |
Upgrade VM1 to H100x2 with 1M context for Nemotron-3-Super
Switch VM1 from n3-H100x1 to n3-H100x2 to run Nemotron-3-Super with
1M token context window via tensor parallelism. The dual-GPU setup
(160 GB total VRAM) provides enough KV cache headroom to override the
model's config.json limit of 262144 tokens.
Key changes:
- flavor_name: n3-H100x1 → n3-H100x2
- tensor_parallel_size: 1 → 2
- max_model_len: 131072 → 1048576 (with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1)
- gpu_memory_utilization: 0.92 → 0.85 (headroom for Mamba cache + sampler warmup)
- Remove --enforce-eager: no longer needed with dual-GPU VRAM budget
- Disable prefix caching: on NemotronH it forces Mamba "all" cache mode
which pre-allocates states for all max_num_seqs and OOMs before the
sampler warmup pass; per-request allocation is cheaper at startup
Add two new vllm config fields to hyperstack.rb:
- extra_docker_env: passes -e KEY=VALUE flags to Docker before the image
name (used for VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and
PYTORCH_ALLOC_CONF=expandable_segments:True)
- enable_prefix_caching: makes --enable-prefix-caching conditional
(default true for backward compat; false for NemotronH)
Both fields are supported in [vllm] defaults and [vllm.presets.*]
overrides with the same fallback semantics as existing fields.
Update pi/agent/models.json: Nemotron vm1 entry renamed to
"Nemotron 3 Super 120B 1M [vm1]" with contextWindow 1048576.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diffstat (limited to 'pi')
| -rw-r--r-- | pi/agent/models.json | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/pi/agent/models.json b/pi/agent/models.json index 39aa450..76e37ab 100644 --- a/pi/agent/models.json +++ b/pi/agent/models.json @@ -108,11 +108,11 @@ "models": [ { "id": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", - "name": "Nemotron 3 Super 120B [vm1]", + "name": "Nemotron 3 Super 120B 1M [vm1]", "reasoning": false, "input": ["text"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, - "contextWindow": 262144, + "contextWindow": 1048576, "maxTokens": 8192 }, { |
