6 files changed, 358 insertions, 0 deletions
diff --git a/docs/operations/cleanup.md b/docs/operations/cleanup.md
new file mode 100644
index 0000000..7835b21
--- /dev/null
+++ b/docs/operations/cleanup.md
@@ -0,0 +1,48 @@
+# Cleanup
+
+## Benchmark data in Prometheus
+
+To remove benchmark metrics from Prometheus, use the provided script:
+
+```bash
+# Port-forward to Prometheus if needed
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
+
+./scripts/cleanup-benchmark-data.sh
+```
+
+The script deletes all `epimetheus_benchmark_*` series via the Admin API and runs clean_tombstones.
+
+**Manual deletion:**
+
+```bash
+# Delete a specific metric
+curl -X POST 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=epimetheus_benchmark_cpu_usage'
+
+# Clean tombstones
+curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
+```
+
+The Admin API must be enabled on Prometheus (see [Setup: Prometheus](setup-prometheus.md)).
+
+## Other cleanup
+
+**Stop port-forwards:**
+
+```bash
+pkill -f "port-forward.*9091"
+pkill -f "port-forward.*9090"
+pkill -f "port-forward.*3000"
+```
+
+**Remove test metrics from Pushgateway:**
+
+```bash
+curl -X DELETE http://localhost:9091/metrics/job/example_metrics_pusher
+```
+
+**Uninstall Pushgateway (Helm):**
+
+```bash
+helm uninstall pushgateway -n monitoring
+```
diff --git a/docs/operations/kubernetes.md b/docs/operations/kubernetes.md
new file mode 100644
index 0000000..20b8b07
--- /dev/null
+++ b/docs/operations/kubernetes.md
@@ -0,0 +1,51 @@
+# Kubernetes
+
+Common tasks when running Epimetheus against Prometheus, Pushgateway, and Grafana in Kubernetes.
+
+## Port-forwards
+
+To run Epimetheus on your laptop against cluster services:
+
+```bash
+# Pushgateway (realtime mode)
+kubectl port-forward -n monitoring svc/pushgateway 9091:9091 &
+
+# Prometheus (historic/watch, queries)
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
+
+# Grafana (dashboards)
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
+```
+
+Then use `http://localhost:9091`, `http://localhost:9090`, and `http://localhost:3000` in Epimetheus flags and in the browser. Adjust service names and namespaces to match your cluster (e.g. `prometheus-kube-prometheus-prometheus` for kube-prometheus-stack).
+
+## Deploying Pushgateway
+
+Example using the official Helm chart:
+
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install pushgateway prometheus-community/prometheus-pushgateway -n monitoring --create-namespace
+```
+
+Alternatively use your own chart (e.g. from the [conf repository](https://codeberg.org/snonux/conf) at `f3s/pushgateway/helm-chart`).
+
+## Deploying the Epimetheus Grafana dashboard
+
+**ConfigMap (recommended):** If you have a manifest that creates a ConfigMap with the dashboard JSON and the Grafana label for auto-discovery:
+
+```bash
+kubectl apply -f ../prometheus/epimetheus-dashboard.yaml
+```
+
+**Script:** From the repo, with Grafana reachable (e.g. after port-forward):
+
+```bash
+./scripts/deploy-dashboard.sh
+# Or with credentials:
+GRAFANA_URL="http://localhost:3000" GRAFANA_USER="admin" GRAFANA_PASSWORD="yourpassword" ./scripts/deploy-dashboard.sh
+```
+
+## Namespace and service names
+
+Replace `monitoring` and the Prometheus/Pushgateway/Grafana service names with whatever your Helm release or manifests use. Epimetheus only needs the URLs; it does not need to run inside the cluster.
diff --git a/docs/operations/macos-setup.md b/docs/operations/macos-setup.md
new file mode 100644
index 0000000..8ed47c9
--- /dev/null
+++ b/docs/operations/macos-setup.md
@@ -0,0 +1,91 @@
+# macOS Setup
+
+## Basic installation
+
+```bash
+brew install prometheus
+brew install grafana
+go install github.com/prometheus/pushgateway@latest
+brew services start grafana
+brew services start prometheus
+~/go/bin/pushgateway &
+```
+
+Log in to Grafana at http://localhost:3000 (default admin:admin; you will be prompted to change the password). Add http://localhost:9090 as a Prometheus datasource.
+
+## Enable Remote Write receiver (required for watch/historic/backfill/auto)
+
+Watch mode, historic mode, backfill mode, and auto mode with old data require the Prometheus Remote Write receiver.
+
+### Option 1: Permanent configuration
+
+Edit the Prometheus arguments file (Homebrew example):
+
+```bash
+nano /opt/homebrew/etc/prometheus.args
+```
+
+Add at the end:
+
+```
+--web.enable-remote-write-receiver
+--web.enable-admin-api
+```
+
+Example full file:
+
+```
+--config.file /opt/homebrew/etc/prometheus.yml
+--web.listen-address=127.0.0.1:9090
+--storage.tsdb.path /opt/homebrew/var/prometheus
+--web.enable-remote-write-receiver
+--web.enable-admin-api
+```
+
+Restart Prometheus:
+
+```bash
+brew services restart prometheus
+```
+
+Verify:
+
+```bash
+curl http://localhost:9090/-/healthy
+curl -X POST http://localhost:9090/api/v1/write   # expect 400, not 404
+```
+
+### Option 2: Temporary (testing only)
+
+```bash
+brew services stop prometheus
+prometheus --web.enable-remote-write-receiver
+```
+
+Keep that terminal open; use another for Epimetheus. This stops when you close the terminal.
+
+## Clearing old metrics (optional)
+
+If the Admin API is enabled:
+
+```bash
+# Delete metrics by name pattern
+curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=~"blockstore_.*"}'
+curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
+sleep 2
+```
+
+## Verify watch mode
+
+```bash
+cat > /tmp/test.csv << EOF
+status,count,method
+200,100,GET
+404,50,POST
+EOF
+
+./epimetheus -mode=watch -file=/tmp/test.csv -metric-name=test \
+  -prometheus=http://localhost:9090/api/v1/write
+```
+
+You should see a success message. In Prometheus (http://localhost:9090), query `{__name__=~"test_.*"}`.
diff --git a/docs/operations/setup-clickhouse.md b/docs/operations/setup-clickhouse.md
new file mode 100644
index 0000000..acc8247
--- /dev/null
+++ b/docs/operations/setup-clickhouse.md
@@ -0,0 +1,43 @@
+# Setup: ClickHouse
+
+ClickHouse is only used in **watch mode**. Epimetheus creates the metrics table automatically if it does not exist.
+
+## Running ClickHouse
+
+- **Linux (systemd):** `sudo systemctl start clickhouse-server`
+- **Docker:** Use the official [ClickHouse image](https://hub.docker.com/r/clickhouse/clickhouse-server) and expose the HTTP interface (default port 8123).
+- **Kubernetes:** Deploy ClickHouse and expose a Service; use the HTTP URL (e.g. `http://clickhouse.monitoring.svc.cluster.local:8123`) as `-clickhouse`.
+
+Default HTTP port is **8123**. Epimetheus uses the HTTP interface, not the native protocol.
+
+## Table Creation
+
+You do not need to create the table manually. On first ingest, Epimetheus runs:
+
+```sql
+CREATE TABLE IF NOT EXISTS epimetheus_metrics (
+    metric String,
+    labels Map(String, String),
+    value Float64,
+    timestamp DateTime64(3)
+) ENGINE = MergeTree()
+ORDER BY (metric, timestamp)
+```
+
+To use a different table name, set `-clickhouse-table`.
+
+## Verification
+
+After running watch mode with `-clickhouse` set, verify ingestion:
+
+```bash
+./scripts/verify-clickhouse.sh
+```
+
+With custom URL or table:
+
+```bash
+./scripts/verify-clickhouse.sh http://localhost:8123 epimetheus_metrics
+```
+
+The script checks connectivity (`/ping`), row count, distinct metrics, sample rows, and rows per metric. If the table is empty or missing, it prints a reminder command to run Epimetheus in watch mode with `-clickhouse`. See [ClickHouse backend](../backends/clickhouse.md) for usage.
diff --git a/docs/operations/setup-prometheus.md b/docs/operations/setup-prometheus.md
new file mode 100644
index 0000000..294ce20
--- /dev/null
+++ b/docs/operations/setup-prometheus.md
@@ -0,0 +1,82 @@
+# Setup: Prometheus
+
+To use historic mode, backfill mode, auto mode with old data, or watch mode with `-prometheus`, you must enable the Prometheus Remote Write receiver. Without it, Epimetheus can only push realtime data via Pushgateway.
+
+## 1. Enable Remote Write Receiver and Admin API
+
+Example configuration (Prometheus 3.x style). Adjust paths and stack to match your environment (e.g. [conf repository](https://codeberg.org/snonux/conf) at `f3s/prometheus/persistence-values.yaml`):
+
+```yaml
+prometheus:
+  prometheusSpec:
+    additionalArgs:
+      - name: web.enable-remote-write-receiver
+        value: ""
+      - name: web.enable-admin-api
+        value: ""
+
+    enableFeatures:
+      - exemplar-storage
+      - otlp-write-receiver
+
+    tsdb:
+      outOfOrderTimeWindow: 744h  # 31 days for backfilling
+```
+
+This provides:
+
+- **Remote Write API** at `/api/v1/write` for ingesting metrics with custom timestamps.
+- **Admin API** at `/api/v1/admin/tsdb/*` for deleting series and cleaning tombstones.
+- **Out-of-order ingestion** so older points can be written for existing series (within the time window).
+
+After changing config, upgrade Prometheus (e.g. `helm upgrade` or your usual apply).
+
+### Verify
+
+```bash
+# Remote Write receiver
+kubectl get pod -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 \
+  -o jsonpath='{.spec.containers[0].args}' | grep -o "web.enable-remote-write-receiver"
+
+# Out-of-order window
+kubectl get prometheus -n monitoring prometheus-kube-prometheus-prometheus \
+  -o jsonpath='{.spec.tsdb.outOfOrderTimeWindow}'
+
+# Admin API
+kubectl get pod -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 \
+  -o jsonpath='{.spec.containers[0].args}' | grep -o "web.enable-admin-api"
+```
+
+**Note:** In Prometheus 3.x use `additionalArgs` for `web.enable-remote-write-receiver`; the older `enableFeatures: [remote-write-receiver]` is deprecated.
+
+## 2. Scrape Config for Pushgateway
+
+For realtime mode, Prometheus must scrape Pushgateway. Example:
+
+```yaml
+# additional-scrape-configs.yaml
+- job_name: 'pushgateway'
+  honor_labels: true
+  static_configs:
+    - targets:
+      - 'pushgateway.monitoring.svc.cluster.local:9091'
+```
+
+Apply as a Secret (example):
+
+```bash
+kubectl create secret generic additional-scrape-configs \
+  --from-file=additional-scrape-configs.yaml \
+  --dry-run=client -o yaml -n monitoring | kubectl apply -f -
+```
+
+## 3. Retention
+
+Check retention so you know how far back Epimetheus can write:
+
+```bash
+kubectl get prometheus -n monitoring prometheus-kube-prometheus-prometheus \
+  -o jsonpath='{.spec.retention}'
+```
+
+For very old data, increase retention or use a dedicated dev/test Prometheus. Enabling out-of-order ingestion and a large `outOfOrderTimeWindow` has memory and I/O trade-offs; see [Prometheus backend](../backends/prometheus.md) and keep production config conservative.
diff --git a/docs/operations/troubleshooting.md b/docs/operations/troubleshooting.md
new file mode 100644
index 0000000..9446508
--- /dev/null
+++ b/docs/operations/troubleshooting.md
@@ -0,0 +1,43 @@
+# Troubleshooting
+
+## Binary can't connect to Pushgateway
+
+- Confirm a port-forward or route to Pushgateway is running, e.g. `ps aux | grep "port-forward.*9091"`.
+- Restart port-forward: `kubectl port-forward -n monitoring svc/pushgateway 9091:9091`.
+- Ensure `-pushgateway` points at the URL you use (e.g. `http://localhost:9091`).
+
+## Metrics not appearing in Prometheus
+
+- **Pushgateway:** `curl http://localhost:9091/metrics | grep "prometheus_pusher_test"` (or your job/metric name). If empty, Epimetheus may not be pushing or the job name may differ.
+- **Scrape:** In Prometheus UI (e.g. http://localhost:9090/targets), check that the Pushgateway job exists and is up.
+- **Logs:** `kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus` (or your Prometheus pod) for scrape/remote-write errors.
+
+## "Remote write receiver not enabled" error
+
+Prometheus must be started with the Remote Write receiver enabled. Verify:
+
+```bash
+kubectl logs -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 | grep "remote-write-receiver"
+```
+
+You should see the feature listed in the enabled features. If not, add `web.enable-remote-write-receiver` (see [Setup: Prometheus](setup-prometheus.md)) and restart Prometheus.
+
+## "Out of order sample" error
+
+You are writing a sample older than existing data for the same series.
+
+- Use different labels for historic data (e.g. `job="historic_data"`), or
+- Enable out-of-order ingestion on Prometheus and set `tsdb.outOfOrderTimeWindow` (see [Setup: Prometheus](setup-prometheus.md)), or
+- Run backfills from oldest to newest.
+
+## Dashboard not appearing in Grafana
+
+- Check the dashboard ConfigMap exists: `kubectl get configmap -n monitoring | grep epimetheus`.
+- Ensure the ConfigMap has the label Grafana uses for dashboard discovery (e.g. `grafana_dashboard: "1"`): `kubectl get configmap epimetheus-dashboard -n monitoring -o yaml | grep "grafana_dashboard"`.
+- Restart Grafana to reload dashboards: `kubectl rollout restart deployment/prometheus-grafana -n monitoring` (adjust deployment name to your setup).
+
+## ClickHouse connection failed
+
+- Ensure ClickHouse is listening on HTTP (default port 8123): `curl -sS http://localhost:8123/ping`.
+- If using Kubernetes, check Service and port-forwards. Use the same URL as `-clickhouse`.
+- See [Setup: ClickHouse](setup-clickhouse.md) and [ClickHouse backend](../backends/clickhouse.md).