summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi1086
-rw-r--r--gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi.tpl1050
-rw-r--r--gemfeed/atom.xml1435
3 files changed, 3418 insertions, 153 deletions
diff --git a/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi b/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi
index 44e6c79a..221f80cd 100644
--- a/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi
+++ b/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi
@@ -1,6 +1,6 @@
# f3s: Kubernetes with FreeBSD - Part 8: Observability
-> Published at 2025-12-06T23:58:24+02:00
+> Published at 2025-12-06T23:58:24+02:00, last updated Mon 09 Mar 09:33:08 EET 2026
This is the 8th blog post about the f3s series for my self-hosting demands in a home lab. f3s? The "f" stands for FreeBSD, and the "3s" stands for k3s, the Kubernetes distribution I use on FreeBSD-based physical machines.
@@ -45,6 +45,42 @@ This is the 8th blog post about the f3s series for my self-hosting demands in a
* ⇢ ⇢ ⇢ Installing Node Exporter on OpenBSD
* ⇢ ⇢ ⇢ Adding OpenBSD hosts to Prometheus
* ⇢ ⇢ ⇢ OpenBSD memory metrics compatibility
+* ⇢ ⇢ Enabling etcd metrics in k3s
+* ⇢ ⇢ ⇢ Configuring Prometheus to scrape etcd
+* ⇢ ⇢ ⇢ Verifying etcd metrics
+* ⇢ ⇢ ⇢ Complete persistence-values.yaml
+* ⇢ ⇢ ZFS Monitoring for FreeBSD Servers
+* ⇢ ⇢ ⇢ Node Exporter ZFS Collector
+* ⇢ ⇢ ⇢ Verifying ZFS Metrics
+* ⇢ ⇢ ⇢ ZFS Recording Rules
+* ⇢ ⇢ ⇢ Grafana Dashboards
+* ⇢ ⇢ ⇢ Deployment
+* ⇢ ⇢ ⇢ Verifying ZFS Metrics in Prometheus
+* ⇢ ⇢ ⇢ Key Metrics to Monitor
+* ⇢ ⇢ ⇢ ZFS Pool and Dataset Metrics via Textfile Collector
+* ⇢ ⇢ Distributed Tracing with Grafana Tempo
+* ⇢ ⇢ ⇢ Why Distributed Tracing?
+* ⇢ ⇢ ⇢ Deploying Grafana Tempo
+* ⇢ ⇢ ⇢# Configuration Strategy
+* ⇢ ⇢ ⇢# Tempo Deployment Files
+* ⇢ ⇢ ⇢# Installation
+* ⇢ ⇢ ⇢ Configuring Grafana Alloy for Trace Collection
+* ⇢ ⇢ ⇢# OTLP Receiver Configuration
+* ⇢ ⇢ ⇢# Upgrade Alloy
+* ⇢ ⇢ ⇢ Demo Tracing Application
+* ⇢ ⇢ ⇢# Application Architecture
+* ⇢ ⇢ ⇢ Visualizing Traces in Grafana
+* ⇢ ⇢ ⇢# Accessing Traces
+* ⇢ ⇢ ⇢# Service Graph Visualization
+* ⇢ ⇢ ⇢ Correlation Between Observability Signals
+* ⇢ ⇢ ⇢# Traces-to-Logs
+* ⇢ ⇢ ⇢# Traces-to-Metrics
+* ⇢ ⇢ ⇢# Logs-to-Traces
+* ⇢ ⇢ ⇢ Generating Traces for Testing
+* ⇢ ⇢ ⇢ Verifying the Complete Pipeline
+* ⇢ ⇢ ⇢ Practical Example: Viewing a Distributed Trace
+* ⇢ ⇢ ⇢ Storage and Retention
+* ⇢ ⇢ ⇢ Configuration Files
* ⇢ ⇢ Summary
## Introduction
@@ -550,7 +586,7 @@ This file is saved as `freebsd-recording-rules.yaml` and applied as part of the
Unlike memory metrics, disk I/O metrics (`node_disk_read_bytes_total`, `node_disk_written_bytes_total`, etc.) are not available on FreeBSD. The Linux diskstats collector that provides these metrics doesn't have a FreeBSD equivalent in the node_exporter.
-The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (`node_zfs_arcstats_*`) for ARC cache performance, and per-dataset I/O stats are available via `sysctl kstat.zfs`, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Creating custom ZFS-specific dashboards is left as an exercise for another day.
+The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (`node_zfs_arcstats_*`) for ARC cache performance, and per-dataset I/O stats are available via `sysctl kstat.zfs`, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Custom ZFS-specific dashboards are covered later in this post.
## Monitoring external OpenBSD hosts
@@ -662,17 +698,1057 @@ This file is saved as `openbsd-recording-rules.yaml` and applied alongside the F
After running `just upgrade`, the OpenBSD hosts appear in Prometheus targets and the Node Exporter dashboards.
+> Updated Mon 09 Mar: Added section about enabling etcd metrics
+
+## Enabling etcd metrics in k3s
+
+The etcd dashboard in Grafana initially showed no data because k3s uses an embedded etcd that doesn't expose metrics by default.
+
+On each control-plane node (r0, r1, r2), create /etc/rancher/k3s/config.yaml:
+
+```
+etcd-expose-metrics: true
+```
+
+Then restart k3s on each node:
+
+```
+systemctl restart k3s
+```
+
+After restarting, etcd metrics are available on port 2381:
+
+```
+curl http://127.0.0.1:2381/metrics | grep etcd
+```
+
+### Configuring Prometheus to scrape etcd
+
+In persistence-values.yaml, enable kubeEtcd with the node IP addresses:
+
+```
+kubeEtcd:
+ enabled: true
+ endpoints:
+ - 192.168.1.120
+ - 192.168.1.121
+ - 192.168.1.122
+ service:
+ enabled: true
+ port: 2381
+ targetPort: 2381
+```
+
+Apply the changes:
+
+```
+just upgrade
+```
+
+### Verifying etcd metrics
+
+After the changes, all etcd targets are being scraped:
+
+```
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 \
+ -c prometheus -- wget -qO- 'http://localhost:9090/api/v1/query?query=etcd_server_has_leader' | \
+ jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"'
+```
+
+Output:
+
+```
+192.168.1.120:2381: 1
+192.168.1.121:2381: 1
+192.168.1.122:2381: 1
+```
+
+The etcd dashboard in Grafana now displays metrics including Raft proposals, leader elections, and peer round trip times.
+
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-etcd-dashboard.png Grafana etcd dashboard showing cluster health, RPC rate, disk sync duration, and peer round trip times
+
+### Complete persistence-values.yaml
+
+The complete updated persistence-values.yaml:
+
+```
+kubeEtcd:
+ enabled: true
+ endpoints:
+ - 192.168.1.120
+ - 192.168.1.121
+ - 192.168.1.122
+ service:
+ enabled: true
+ port: 2381
+ targetPort: 2381
+
+prometheus:
+ prometheusSpec:
+ additionalScrapeConfigsSecret:
+ enabled: true
+ name: additional-scrape-configs
+ key: additional-scrape-configs.yaml
+ storageSpec:
+ volumeClaimTemplate:
+ spec:
+ storageClassName: ""
+ accessModes: ["ReadWriteOnce"]
+ resources:
+ requests:
+ storage: 10Gi
+ selector:
+ matchLabels:
+ type: local
+ app: prometheus
+
+grafana:
+ persistence:
+ enabled: true
+ type: pvc
+ existingClaim: "grafana-data-pvc"
+
+ initChownData:
+ enabled: false
+
+ podSecurityContext:
+ fsGroup: 911
+ runAsUser: 911
+ runAsGroup: 911
+```
+
+> Updated Mon 09 Mar: Added section about ZFS monitoring for FreeBSD servers
+
+## ZFS Monitoring for FreeBSD Servers
+
+The FreeBSD servers (f0, f1, f2) that provide NFS storage to the k3s cluster have ZFS filesystems. Monitoring ZFS performance is crucial for understanding storage performance and cache efficiency.
+
+### Node Exporter ZFS Collector
+
+The node_exporter running on each FreeBSD server (v1.9.1) includes a built-in ZFS collector that exposes metrics via sysctls. The ZFS collector is enabled by default and provides:
+
+* ARC (Adaptive Replacement Cache) statistics
+* Cache hit/miss rates
+* Memory usage and allocation
+* MRU/MFU cache breakdown
+* Data vs metadata distribution
+
+### Verifying ZFS Metrics
+
+On any FreeBSD server, check that ZFS metrics are being exposed:
+
+```
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep node_zfs_arcstats | wc -l
+ 69
+```
+
+The metrics are automatically scraped by Prometheus through the existing static configuration in additional-scrape-configs.yaml which targets all FreeBSD servers on port 9100 with the os: freebsd label.
+
+### ZFS Recording Rules
+
+Created recording rules for easier dashboard consumption in zfs-recording-rules.yaml:
+
+```
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: freebsd-zfs-rules
+ namespace: monitoring
+ labels:
+ release: prometheus
+spec:
+ groups:
+ - name: freebsd-zfs-arc
+ interval: 30s
+ rules:
+ - record: node_zfs_arc_hit_rate_percent
+ expr: |
+ 100 * (
+ rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) /
+ (rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) +
+ rate(node_zfs_arcstats_misses_total{os="freebsd"}[5m]))
+ )
+ labels:
+ os: freebsd
+ - record: node_zfs_arc_memory_usage_percent
+ expr: |
+ 100 * (
+ node_zfs_arcstats_size_bytes{os="freebsd"} /
+ node_zfs_arcstats_c_max_bytes{os="freebsd"}
+ )
+ labels:
+ os: freebsd
+ # Additional rules for metadata %, target %, MRU/MFU %, etc.
+```
+
+These recording rules calculate:
+
+* ARC hit rate percentage
+* ARC memory usage percentage (current vs maximum)
+* ARC target percentage (target vs maximum)
+* Metadata vs data percentages
+* MRU vs MFU cache percentages
+* Demand data and metadata hit rates
+
+### Grafana Dashboards
+
+Created two comprehensive ZFS monitoring dashboards (zfs-dashboards.yaml):
+
+**Dashboard 1: FreeBSD ZFS (per-host detailed view)**
+
+Includes variables to select:
+
+* FreeBSD server (f0, f1, or f2)
+* ZFS pool (zdata, zroot, or all)
+
+Pool Overview Row:
+
+* Pool Capacity gauge (with thresholds: green <70%, yellow <85%, red >85%)
+* Pool Health status (ONLINE/DEGRADED/FAULTED with color coding)
+* Total Pool Size stat
+* Free Space stat
+* Pool Space Usage Over Time (stacked: used + free)
+* Pool Capacity Trend time series
+
+Dataset Statistics Row:
+
+* Table showing all datasets with columns: Pool, Dataset, Used, Available, Referenced
+* Automatically filters by selected pool
+
+ARC Cache Statistics Row:
+
+* ARC Hit Rate gauge (red <70%, yellow <90%, green >=90%)
+* ARC Size time series (current, target, max)
+* ARC Memory Usage percentage gauge
+* ARC Hits vs Misses rate
+* ARC Data vs Metadata stacked time series
+
+**Dashboard 2: FreeBSD ZFS Summary (cluster-wide overview)**
+
+Cluster-Wide Pool Statistics Row:
+
+* Total Storage Capacity across all servers
+* Total Used space
+* Total Free space
+* Average Pool Capacity gauge
+* Pool Health Status (worst case across cluster)
+* Total Pool Space Usage Over Time
+* Per-Pool Capacity time series (all pools on all hosts)
+
+Per-Host Pool Breakdown Row:
+
+* Bar gauge showing capacity by host and pool
+* Table with all pools: Host, Pool, Size, Used, Free, Capacity %, Health
+
+Cluster-Wide ARC Statistics Row:
+
+* Average ARC Hit Rate gauge across all hosts
+* ARC Hit Rate by Host time series
+* Total ARC Size Across Cluster
+* Total ARC Hits vs Misses (cluster-wide sum)
+* ARC Size by Host
+
+Dashboard Visualization:
+
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-dashboard.png ZFS monitoring dashboard in Grafana showing pool capacity, health, and I/O throughput
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-arc-stats.png ZFS ARC cache statistics showing hit rate, memory usage, and size trends
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-datasets.png ZFS datasets table and ARC data vs metadata breakdown
+
+### Deployment
+
+Applied the resources to the cluster:
+
+```
+cd /home/paul/git/conf/f3s/prometheus
+kubectl apply -f zfs-recording-rules.yaml
+kubectl apply -f zfs-dashboards.yaml
+```
+
+Updated Justfile to include ZFS recording rules in install and upgrade targets:
+
+```
+install:
+ kubectl apply -f persistent-volumes.yaml
+ kubectl create secret generic additional-scrape-configs --from-file=additional-scrape-configs.yaml -n monitoring --dry-run=client -o yaml | kubectl apply -f -
+ helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring -f persistence-values.yaml
+ kubectl apply -f freebsd-recording-rules.yaml
+ kubectl apply -f openbsd-recording-rules.yaml
+ kubectl apply -f zfs-recording-rules.yaml
+ just -f grafana-ingress/Justfile install
+```
+
+### Verifying ZFS Metrics in Prometheus
+
+Check that ZFS metrics are being collected:
+
+```
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- 'http://localhost:9090/api/v1/query?query=node_zfs_arcstats_size_bytes'
+```
+
+Check recording rules are calculating correctly:
+
+```
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- 'http://localhost:9090/api/v1/query?query=node_zfs_arc_memory_usage_percent'
+```
+
+Example output shows memory usage percentage for each FreeBSD server:
+
+```
+"result":[
+ {"metric":{"instance":"192.168.2.130:9100","os":"freebsd"},"value":[...,"37.58"]},
+ {"metric":{"instance":"192.168.2.131:9100","os":"freebsd"},"value":[...,"12.85"]},
+ {"metric":{"instance":"192.168.2.132:9100","os":"freebsd"},"value":[...,"13.44"]}
+]
+```
+
+### Key Metrics to Monitor
+
+* ARC Hit Rate: Should typically be above 90% for optimal performance. Lower hit rates indicate the ARC cache is too small or workload has poor locality.
+* ARC Memory Usage: Shows how much of the maximum ARC size is being used. If consistently at or near maximum, the ARC is effectively utilizing available memory.
+* Data vs Metadata: Typically data should dominate, but workloads with many small files will show higher metadata percentages.
+* MRU vs MFU: Most Recently Used vs Most Frequently Used cache. The ratio depends on workload characteristics.
+* Pool Capacity: Monitor pool usage to ensure adequate free space. ZFS performance degrades when pools exceed 80% capacity.
+* Pool Health: Should always show ONLINE (green). DEGRADED (yellow) indicates a disk issue requiring attention. FAULTED (red) requires immediate action.
+* Dataset Usage: Track which datasets are consuming the most space to identify growth trends and plan capacity.
+
+### ZFS Pool and Dataset Metrics via Textfile Collector
+
+To complement the ARC statistics from node_exporter's built-in ZFS collector, I added pool capacity and dataset metrics using the textfile collector feature.
+
+Created a script at `/usr/local/bin/zfs_pool_metrics.sh` on each FreeBSD server:
+
+```
+#!/bin/sh
+# ZFS Pool and Dataset Metrics Collector for Prometheus
+
+OUTPUT_FILE="/var/tmp/node_exporter/zfs_pools.prom.$$"
+FINAL_FILE="/var/tmp/node_exporter/zfs_pools.prom"
+
+mkdir -p /var/tmp/node_exporter
+
+{
+ # Pool metrics
+ echo "# HELP zfs_pool_size_bytes Total size of ZFS pool"
+ echo "# TYPE zfs_pool_size_bytes gauge"
+ echo "# HELP zfs_pool_allocated_bytes Allocated space in ZFS pool"
+ echo "# TYPE zfs_pool_allocated_bytes gauge"
+ echo "# HELP zfs_pool_free_bytes Free space in ZFS pool"
+ echo "# TYPE zfs_pool_free_bytes gauge"
+ echo "# HELP zfs_pool_capacity_percent Capacity percentage"
+ echo "# TYPE zfs_pool_capacity_percent gauge"
+ echo "# HELP zfs_pool_health Pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)"
+ echo "# TYPE zfs_pool_health gauge"
+
+ zpool list -Hp -o name,size,allocated,free,capacity,health | \
+ while IFS=$'\t' read name size alloc free cap health; do
+ case "$health" in
+ ONLINE) health_val=0 ;;
+ DEGRADED) health_val=1 ;;
+ FAULTED) health_val=2 ;;
+ *) health_val=6 ;;
+ esac
+ cap_num=$(echo "$cap" | sed 's/%//')
+
+ echo "zfs_pool_size_bytes{pool=\"$name\"} $size"
+ echo "zfs_pool_allocated_bytes{pool=\"$name\"} $alloc"
+ echo "zfs_pool_free_bytes{pool=\"$name\"} $free"
+ echo "zfs_pool_capacity_percent{pool=\"$name\"} $cap_num"
+ echo "zfs_pool_health{pool=\"$name\"} $health_val"
+ done
+
+ # Dataset metrics
+ echo "# HELP zfs_dataset_used_bytes Used space in dataset"
+ echo "# TYPE zfs_dataset_used_bytes gauge"
+ echo "# HELP zfs_dataset_available_bytes Available space"
+ echo "# TYPE zfs_dataset_available_bytes gauge"
+ echo "# HELP zfs_dataset_referenced_bytes Referenced space"
+ echo "# TYPE zfs_dataset_referenced_bytes gauge"
+
+ zfs list -Hp -t filesystem -o name,used,available,referenced | \
+ while IFS=$'\t' read name used avail ref; do
+ pool=$(echo "$name" | cut -d/ -f1)
+ echo "zfs_dataset_used_bytes{pool=\"$pool\",dataset=\"$name\"} $used"
+ echo "zfs_dataset_available_bytes{pool=\"$pool\",dataset=\"$name\"} $avail"
+ echo "zfs_dataset_referenced_bytes{pool=\"$pool\",dataset=\"$name\"} $ref"
+ done
+} > "$OUTPUT_FILE"
+
+mv "$OUTPUT_FILE" "$FINAL_FILE"
+```
+
+Deployed to all FreeBSD servers:
+
+```
+for host in f0 f1 f2; do
+ scp /tmp/zfs_pool_metrics.sh paul@$host:/tmp/
+ ssh paul@$host 'doas mv /tmp/zfs_pool_metrics.sh /usr/local/bin/ && \
+ doas chmod +x /usr/local/bin/zfs_pool_metrics.sh'
+done
+```
+
+Set up cron jobs to run every minute:
+
+```
+for host in f0 f1 f2; do
+ ssh paul@$host 'echo "* * * * * /usr/local/bin/zfs_pool_metrics.sh >/dev/null 2>&1" | \
+ doas crontab -'
+done
+```
+
+The textfile collector (already configured with --collector.textfile.directory=/var/tmp/node_exporter) automatically picks up the metrics.
+
+Verify metrics are being exposed:
+
+```
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep "^zfs_pool" | head -5
+zfs_pool_allocated_bytes{pool="zdata"} 6.47622733824e+11
+zfs_pool_allocated_bytes{pool="zroot"} 5.3338578944e+10
+zfs_pool_capacity_percent{pool="zdata"} 64
+zfs_pool_capacity_percent{pool="zroot"} 10
+zfs_pool_free_bytes{pool="zdata"} 3.48809678848e+11
+```
+
+> Updated Mon 09 Mar: Added section about distributed tracing with Grafana Tempo
+
+## Distributed Tracing with Grafana Tempo
+
+After implementing logs (Loki) and metrics (Prometheus), the final pillar of observability is distributed tracing. Grafana Tempo provides distributed tracing capabilities that help understand request flows across microservices.
+
+How will this look tracing with Tempo like in Grafana? Have a look at the X-RAG blog post of mine:
+
+=> ./2025-12-24-x-rag-observability-hackathon.gmi X-RAG Observability Hackathon
+
+### Why Distributed Tracing?
+
+In a microservices architecture, a single user request may traverse multiple services. Distributed tracing:
+
+* Tracks requests across service boundaries
+* Identifies performance bottlenecks
+* Visualizes service dependencies
+* Correlates with logs and metrics
+* Helps debug complex distributed systems
+
+### Deploying Grafana Tempo
+
+Tempo is deployed in monolithic mode, following the same pattern as Loki's SingleBinary deployment.
+
+#### Configuration Strategy
+
+**Deployment Mode:** Monolithic (all components in one process)
+* Simpler operation than microservices mode
+* Suitable for the cluster scale
+* Consistent with Loki deployment pattern
+
+**Storage:** Filesystem backend using hostPath
+* 10Gi storage at /data/nfs/k3svolumes/tempo/data
+* 7-day retention (168h)
+* Local storage is the only option for monolithic mode
+
+**OTLP Receivers:** Standard OpenTelemetry Protocol ports
+* gRPC: 4317
+* HTTP: 4318
+* Bind to 0.0.0.0 to avoid Tempo 2.7+ localhost-only binding issue
+
+#### Tempo Deployment Files
+
+Created in /home/paul/git/conf/f3s/tempo/:
+
+**values.yaml** - Helm chart configuration:
+
+```
+tempo:
+ retention: 168h
+ storage:
+ trace:
+ backend: local
+ local:
+ path: /var/tempo/traces
+ wal:
+ path: /var/tempo/wal
+ receivers:
+ otlp:
+ protocols:
+ grpc:
+ endpoint: 0.0.0.0:4317
+ http:
+ endpoint: 0.0.0.0:4318
+
+persistence:
+ enabled: true
+ size: 10Gi
+ storageClassName: ""
+
+resources:
+ limits:
+ cpu: 1000m
+ memory: 2Gi
+ requests:
+ cpu: 500m
+ memory: 1Gi
+```
+
+**persistent-volumes.yaml** - Storage configuration:
+
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: tempo-data-pv
+spec:
+ capacity:
+ storage: 10Gi
+ accessModes:
+ - ReadWriteOnce
+ persistentVolumeReclaimPolicy: Retain
+ hostPath:
+ path: /data/nfs/k3svolumes/tempo/data
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: tempo-data-pvc
+ namespace: monitoring
+spec:
+ storageClassName: ""
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 10Gi
+```
+
+**Grafana Datasource Provisioning**
+
+All Grafana datasources (Prometheus, Alertmanager, Loki, Tempo) are provisioned via a unified ConfigMap that is directly mounted to the Grafana pod. This approach ensures datasources are loaded on startup without requiring sidecar-based discovery.
+
+In /home/paul/git/conf/f3s/prometheus/grafana-datasources-all.yaml:
+
+```
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: grafana-datasources-all
+ namespace: monitoring
+data:
+ datasources.yaml: |
+ apiVersion: 1
+ datasources:
+ - name: Prometheus
+ type: prometheus
+ uid: prometheus
+ url: http://prometheus-kube-prometheus-prometheus.monitoring:9090/
+ access: proxy
+ isDefault: true
+ - name: Alertmanager
+ type: alertmanager
+ uid: alertmanager
+ url: http://prometheus-kube-prometheus-alertmanager.monitoring:9093/
+ - name: Loki
+ type: loki
+ uid: loki
+ url: http://loki.monitoring.svc.cluster.local:3100
+ - name: Tempo
+ type: tempo
+ uid: tempo
+ url: http://tempo.monitoring.svc.cluster.local:3200
+ jsonData:
+ tracesToLogsV2:
+ datasourceUid: loki
+ spanStartTimeShift: -1h
+ spanEndTimeShift: 1h
+ tracesToMetrics:
+ datasourceUid: prometheus
+ serviceMap:
+ datasourceUid: prometheus
+ nodeGraph:
+ enabled: true
+```
+
+The kube-prometheus-stack Helm values (persistence-values.yaml) are configured to:
+* Disable sidecar-based datasource provisioning
+* Mount grafana-datasources-all ConfigMap directly to /etc/grafana/provisioning/datasources/
+
+This direct mounting approach is simpler and more reliable than sidecar-based discovery.
+
+#### Installation
+
+```
+cd /home/paul/git/conf/f3s/tempo
+just install
+```
+
+Verify Tempo is running:
+
+```
+kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
+kubectl exec -n monitoring <tempo-pod> -- wget -qO- http://localhost:3200/ready
+```
+
+### Configuring Grafana Alloy for Trace Collection
+
+Updated /home/paul/git/conf/f3s/loki/alloy-values.yaml to add OTLP receivers for traces while maintaining existing log collection.
+
+#### OTLP Receiver Configuration
+
+Added to Alloy configuration after the log collection pipeline:
+
+```
+// OTLP receiver for traces via gRPC and HTTP
+otelcol.receiver.otlp "default" {
+ grpc {
+ endpoint = "0.0.0.0:4317"
+ }
+ http {
+ endpoint = "0.0.0.0:4318"
+ }
+ output {
+ traces = [otelcol.processor.batch.default.input]
+ }
+}
+
+// Batch processor for efficient trace forwarding
+otelcol.processor.batch "default" {
+ timeout = "5s"
+ send_batch_size = 100
+ send_batch_max_size = 200
+ output {
+ traces = [otelcol.exporter.otlp.tempo.input]
+ }
+}
+
+// OTLP exporter to send traces to Tempo
+otelcol.exporter.otlp "tempo" {
+ client {
+ endpoint = "tempo.monitoring.svc.cluster.local:4317"
+ tls {
+ insecure = true
+ }
+ compression = "gzip"
+ }
+}
+```
+
+The batch processor reduces network overhead by accumulating spans before forwarding to Tempo.
+
+#### Upgrade Alloy
+
+```
+cd /home/paul/git/conf/f3s/loki
+just upgrade
+```
+
+Verify OTLP receivers are listening:
+
+```
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i "otlp.*receiver"
+kubectl exec -n monitoring <alloy-pod> -- netstat -ln | grep -E ':(4317|4318)'
+```
+
+### Demo Tracing Application
+
+Created a three-tier Python application to demonstrate distributed tracing in action.
+
+#### Application Architecture
+
+```
+User → Frontend (Flask:5000) → Middleware (Flask:5001) → Backend (Flask:5002)
+ ↓ ↓ ↓
+ Alloy (OTLP:4317) → Tempo → Grafana
+```
+
+Frontend Service:
+
+* Receives HTTP requests at /api/process
+* Forwards to middleware service
+* Creates parent span for the entire request
+
+Middleware Service:
+
+* Transforms data at /api/transform
+* Calls backend service
+* Creates child span linked to frontend
+
+Backend Service:
+
+* Returns data at /api/data
+* Simulates database query (100ms sleep)
+* Creates leaf span in the trace
+
+OpenTelemetry Instrumentation:
+
+All services use Python OpenTelemetry libraries:
+
+**Dependencies:**
+```
+flask==3.0.0
+requests==2.31.0
+opentelemetry-distro==0.49b0
+opentelemetry-exporter-otlp==1.28.0
+opentelemetry-instrumentation-flask==0.49b0
+opentelemetry-instrumentation-requests==0.49b0
+```
+
+**Auto-instrumentation pattern** (used in all services):
+
+```python
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
+from opentelemetry.instrumentation.requests import RequestsInstrumentor
+from opentelemetry.sdk.resources import Resource
+
+# Define service identity
+resource = Resource(attributes={
+ "service.name": "frontend",
+ "service.namespace": "tracing-demo",
+ "service.version": "1.0.0"
+})
+
+provider = TracerProvider(resource=resource)
+
+# Export to Alloy
+otlp_exporter = OTLPSpanExporter(
+ endpoint="http://alloy.monitoring.svc.cluster.local:4317",
+ insecure=True
+)
+
+processor = BatchSpanProcessor(otlp_exporter)
+provider.add_span_processor(processor)
+trace.set_tracer_provider(provider)
+
+# Auto-instrument Flask and requests
+FlaskInstrumentor().instrument_app(app)
+RequestsInstrumentor().instrument()
+```
+
+The auto-instrumentation automatically:
+* Creates spans for HTTP requests
+* Propagates trace context via W3C Trace Context headers
+* Links parent and child spans across service boundaries
+
+Deployment:
+
+Created Helm chart in /home/paul/git/conf/f3s/tracing-demo/ with three separate deployments, services, and an ingress.
+
+Build and deploy:
+
+```
+cd /home/paul/git/conf/f3s/tracing-demo
+just build
+just import
+just install
+```
+
+Verify deployment:
+
+```
+kubectl get pods -n services | grep tracing-demo
+kubectl get ingress -n services tracing-demo-ingress
+```
+
+Access the application at:
+
+=> http://tracing-demo.f3s.buetow.org
+
+### Visualizing Traces in Grafana
+
+The Tempo datasource is automatically discovered by Grafana through the ConfigMap label.
+
+#### Accessing Traces
+
+Navigate to Grafana → Explore → Select "Tempo" datasource
+
+**Search Interface:**
+* Search by Trace ID
+* Search by service name
+* Search by tags
+
+**TraceQL Queries:**
+
+Find all traces from demo app:
+```
+{ resource.service.namespace = "tracing-demo" }
+```
+
+Find slow requests (>200ms):
+```
+{ duration > 200ms }
+```
+
+Find traces from specific service:
+```
+{ resource.service.name = "frontend" }
+```
+
+Find errors:
+```
+{ status = error }
+```
+
+Complex query - frontend traces calling middleware:
+```
+{ resource.service.namespace = "tracing-demo" } && { span.http.status_code >= 500 }
+```
+
+#### Service Graph Visualization
+
+The service graph shows visual connections between services:
+
+1. Navigate to Explore → Tempo
+2. Enable "Service Graph" view
+3. Shows: Frontend → Middleware → Backend with request rates
+
+The service graph uses Prometheus metrics generated from trace data.
+
+### Correlation Between Observability Signals
+
+Tempo integrates with Loki and Prometheus to provide unified observability.
+
+#### Traces-to-Logs
+
+Click on any span in a trace to see related logs:
+
+1. View trace in Grafana
+2. Click on a span
+3. Select "Logs for this span"
+4. Loki shows logs filtered by:
+ * Time range (span duration ± 1 hour)
+ * Service name
+ * Namespace
+ * Pod
+
+This helps correlate what the service was doing when the span was created.
+
+#### Traces-to-Metrics
+
+View Prometheus metrics for services in the trace:
+
+1. View trace in Grafana
+2. Select "Metrics" tab
+3. Shows metrics like:
+ * Request rate
+ * Error rate
+ * Duration percentiles
+
+#### Logs-to-Traces
+
+From logs, you can jump to related traces:
+
+1. In Loki, logs that contain trace IDs are automatically linked
+2. Click the trace ID to view the full trace
+3. See the complete request flow
+
+### Generating Traces for Testing
+
+Test the demo application:
+
+```
+curl http://tracing-demo.f3s.buetow.org/api/process
+```
+
+Load test (generates 50 traces):
+
+```
+cd /home/paul/git/conf/f3s/tracing-demo
+just load-test
+```
+
+Each request creates a distributed trace spanning all three services.
+
+### Verifying the Complete Pipeline
+
+Check the trace flow end-to-end:
+
+**1. Application generates traces:**
+```
+kubectl logs -n services -l app=tracing-demo-frontend | grep -i trace
+```
+
+**2. Alloy receives traces:**
+```
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i otlp
+```
+
+**3. Tempo stores traces:**
+```
+kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i trace
+```
+
+**4. Grafana displays traces:**
+Navigate to Explore → Tempo → Search for traces
+
+### Practical Example: Viewing a Distributed Trace
+
+Let's generate a trace and examine it in Grafana.
+
+**1. Generate a trace by calling the demo application:**
+
+```
+curl -H "Host: tracing-demo.f3s.buetow.org" http://r0/api/process
+```
+
+**Response (HTTP 200):**
+
+```json
+{
+ "middleware_response": {
+ "backend_data": {
+ "data": {
+ "id": 12345,
+ "query_time_ms": 100.0,
+ "timestamp": "2025-12-28T18:35:01.064538",
+ "value": "Sample data from backend service"
+ },
+ "service": "backend"
+ },
+ "middleware_processed": true,
+ "original_data": {
+ "source": "GET request"
+ },
+ "transformation_time_ms": 50
+ },
+ "request_data": {
+ "source": "GET request"
+ },
+ "service": "frontend",
+ "status": "success"
+}
+```
+
+**2. Find the trace in Tempo via API:**
+
+After a few seconds (for batch export), search for recent traces:
+
+```
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ 'http://localhost:3200/api/search?tags=service.namespace%3Dtracing-demo&limit=5' 2>/dev/null | \
+ python3 -m json.tool
+```
+
+Returns traces including:
+
+```json
+{
+ "traceID": "4be1151c0bdcd5625ac7e02b98d95bd5",
+ "rootServiceName": "frontend",
+ "rootTraceName": "GET /api/process",
+ "durationMs": 221
+}
+```
+
+**3. Fetch complete trace details:**
+
+```
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ 'http://localhost:3200/api/traces/4be1151c0bdcd5625ac7e02b98d95bd5' 2>/dev/null | \
+ python3 -m json.tool
+```
+
+**Trace structure (8 spans across 3 services):**
+
+```
+Trace ID: 4be1151c0bdcd5625ac7e02b98d95bd5
+Services: 3 (frontend, middleware, backend)
+
+Service: frontend
+ └─ GET /api/process 221.10ms (HTTP server span)
+ └─ frontend-process 216.23ms (custom business logic span)
+ └─ POST 209.97ms (HTTP client span to middleware)
+
+Service: middleware
+ └─ POST /api/transform 186.02ms (HTTP server span)
+ └─ middleware-transform 180.96ms (custom business logic span)
+ └─ GET 127.52ms (HTTP client span to backend)
+
+Service: backend
+ └─ GET /api/data 103.93ms (HTTP server span)
+ └─ backend-get-data 102.11ms (custom business logic span with 100ms sleep)
+```
+
+**4. View the trace in Grafana UI:**
+
+Navigate to: Grafana → Explore → Tempo datasource
+
+Search using TraceQL:
+```
+{ resource.service.namespace = "tracing-demo" }
+```
+
+Or directly open the trace by pasting the trace ID in the search box:
+```
+4be1151c0bdcd5625ac7e02b98d95bd5
+```
+
+**5. Trace visualization:**
+
+The trace waterfall view in Grafana shows the complete request flow with timing:
+
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-tempo-trace.png Distributed trace visualization in Grafana Tempo showing Frontend → Middleware → Backend spans
+
+For additional examples of Tempo trace visualization, see also:
+
+=> https://foo.zone/gemfeed/2025-12-24-x-rag-observability-hackathon.html X-RAG Observability Hackathon (more Grafana Tempo screenshots)
+
+The trace reveals the distributed request flow:
+
+* Frontend (221ms): Receives GET /api/process, executes business logic, calls middleware
+* Middleware (186ms): Receives POST /api/transform, transforms data, calls backend
+* Backend (104ms): Receives GET /api/data, simulates database query with 100ms sleep
+* Total request time: 221ms end-to-end
+* Span propagation: W3C Trace Context headers automatically link all spans
+
+**6. Service graph visualization:**
+
+The service graph is automatically generated from traces and shows service dependencies. For examples of service graph visualization in Grafana, see the screenshots in the X-RAG Observability Hackathon blog post.
+
+=> ./2025-12-24-x-rag-observability-hackathon.gmi X-RAG Observability Hackathon (includes service graph screenshots)
+
+This visualization helps identify:
+
+* Request rates between services
+* Average latency for each hop
+* Error rates (if any)
+* Service dependencies and communication patterns
+
+### Storage and Retention
+
+Monitor Tempo storage usage:
+
+```
+kubectl exec -n monitoring <tempo-pod> -- df -h /var/tempo
+```
+
+With 10Gi storage and 7-day retention, the system handles moderate trace volumes. If storage fills up:
+
+* Reduce retention to 72h (3 days)
+* Implement sampling in Alloy
+* Increase PV size
+
+### Configuration Files
+
+All configuration files are available on Codeberg:
+
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/tempo Tempo configuration
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/loki Alloy configuration (updated for traces)
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/tracing-demo Demo tracing application
+
## Summary
-With Prometheus, Grafana, Loki, and Alloy deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:
+With Prometheus, Grafana, Loki, Alloy, and Tempo deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:
-* metrics: Prometheus collects and stores time-series data from all components
+* Metrics: Prometheus collects and stores time-series data from all components, including etcd and ZFS
* Logs: Loki aggregates logs from all containers, searchable via Grafana
-* Visualisation: Grafana provides dashboards and exploration tools
+* Traces: Tempo provides distributed request tracing with service dependency mapping
+* Visualisation: Grafana provides dashboards and exploration tools with correlation between all three signals
* Alerting: Alertmanager can notify on conditions defined in Prometheus rules
This observability stack runs entirely on the home lab infrastructure, with data persisted to the NFS share. It's lightweight enough for a three-node cluster but provides the same capabilities as production-grade setups.
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus prometheus configuration on Codeberg
+
Other *BSD-related posts:
=> ./2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi 2025-12-07 f3s: Kubernetes with FreeBSD - Part 8: Observability (You are currently reading this)
diff --git a/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi.tpl b/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi.tpl
index da66ed0a..dbeee59c 100644
--- a/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi.tpl
+++ b/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi.tpl
@@ -1,6 +1,6 @@
# f3s: Kubernetes with FreeBSD - Part 8: Observability
-> Published at 2025-12-06T23:58:24+02:00
+> Published at 2025-12-06T23:58:24+02:00, last updated Mon 09 Mar 09:33:08 EET 2026
This is the 8th blog post about the f3s series for my self-hosting demands in a home lab. f3s? The "f" stands for FreeBSD, and the "3s" stands for k3s, the Kubernetes distribution I use on FreeBSD-based physical machines.
@@ -513,7 +513,7 @@ This file is saved as `freebsd-recording-rules.yaml` and applied as part of the
Unlike memory metrics, disk I/O metrics (`node_disk_read_bytes_total`, `node_disk_written_bytes_total`, etc.) are not available on FreeBSD. The Linux diskstats collector that provides these metrics doesn't have a FreeBSD equivalent in the node_exporter.
-The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (`node_zfs_arcstats_*`) for ARC cache performance, and per-dataset I/O stats are available via `sysctl kstat.zfs`, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Creating custom ZFS-specific dashboards is left as an exercise for another day.
+The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (`node_zfs_arcstats_*`) for ARC cache performance, and per-dataset I/O stats are available via `sysctl kstat.zfs`, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Custom ZFS-specific dashboards are covered later in this post.
## Monitoring external OpenBSD hosts
@@ -625,17 +625,1057 @@ This file is saved as `openbsd-recording-rules.yaml` and applied alongside the F
After running `just upgrade`, the OpenBSD hosts appear in Prometheus targets and the Node Exporter dashboards.
+> Updated Mon 09 Mar: Added section about enabling etcd metrics
+
+## Enabling etcd metrics in k3s
+
+The etcd dashboard in Grafana initially showed no data because k3s uses an embedded etcd that doesn't expose metrics by default.
+
+On each control-plane node (r0, r1, r2), create /etc/rancher/k3s/config.yaml:
+
+```
+etcd-expose-metrics: true
+```
+
+Then restart k3s on each node:
+
+```
+systemctl restart k3s
+```
+
+After restarting, etcd metrics are available on port 2381:
+
+```
+curl http://127.0.0.1:2381/metrics | grep etcd
+```
+
+### Configuring Prometheus to scrape etcd
+
+In persistence-values.yaml, enable kubeEtcd with the node IP addresses:
+
+```
+kubeEtcd:
+ enabled: true
+ endpoints:
+ - 192.168.1.120
+ - 192.168.1.121
+ - 192.168.1.122
+ service:
+ enabled: true
+ port: 2381
+ targetPort: 2381
+```
+
+Apply the changes:
+
+```
+just upgrade
+```
+
+### Verifying etcd metrics
+
+After the changes, all etcd targets are being scraped:
+
+```
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 \
+ -c prometheus -- wget -qO- 'http://localhost:9090/api/v1/query?query=etcd_server_has_leader' | \
+ jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"'
+```
+
+Output:
+
+```
+192.168.1.120:2381: 1
+192.168.1.121:2381: 1
+192.168.1.122:2381: 1
+```
+
+The etcd dashboard in Grafana now displays metrics including Raft proposals, leader elections, and peer round trip times.
+
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-etcd-dashboard.png Grafana etcd dashboard showing cluster health, RPC rate, disk sync duration, and peer round trip times
+
+### Complete persistence-values.yaml
+
+The complete updated persistence-values.yaml:
+
+```
+kubeEtcd:
+ enabled: true
+ endpoints:
+ - 192.168.1.120
+ - 192.168.1.121
+ - 192.168.1.122
+ service:
+ enabled: true
+ port: 2381
+ targetPort: 2381
+
+prometheus:
+ prometheusSpec:
+ additionalScrapeConfigsSecret:
+ enabled: true
+ name: additional-scrape-configs
+ key: additional-scrape-configs.yaml
+ storageSpec:
+ volumeClaimTemplate:
+ spec:
+ storageClassName: ""
+ accessModes: ["ReadWriteOnce"]
+ resources:
+ requests:
+ storage: 10Gi
+ selector:
+ matchLabels:
+ type: local
+ app: prometheus
+
+grafana:
+ persistence:
+ enabled: true
+ type: pvc
+ existingClaim: "grafana-data-pvc"
+
+ initChownData:
+ enabled: false
+
+ podSecurityContext:
+ fsGroup: 911
+ runAsUser: 911
+ runAsGroup: 911
+```
+
+> Updated Mon 09 Mar: Added section about ZFS monitoring for FreeBSD servers
+
+## ZFS Monitoring for FreeBSD Servers
+
+The FreeBSD servers (f0, f1, f2) that provide NFS storage to the k3s cluster have ZFS filesystems. Monitoring ZFS performance is crucial for understanding storage performance and cache efficiency.
+
+### Node Exporter ZFS Collector
+
+The node_exporter running on each FreeBSD server (v1.9.1) includes a built-in ZFS collector that exposes metrics via sysctls. The ZFS collector is enabled by default and provides:
+
+* ARC (Adaptive Replacement Cache) statistics
+* Cache hit/miss rates
+* Memory usage and allocation
+* MRU/MFU cache breakdown
+* Data vs metadata distribution
+
+### Verifying ZFS Metrics
+
+On any FreeBSD server, check that ZFS metrics are being exposed:
+
+```
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep node_zfs_arcstats | wc -l
+ 69
+```
+
+The metrics are automatically scraped by Prometheus through the existing static configuration in additional-scrape-configs.yaml which targets all FreeBSD servers on port 9100 with the os: freebsd label.
+
+### ZFS Recording Rules
+
+Created recording rules for easier dashboard consumption in zfs-recording-rules.yaml:
+
+```
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: freebsd-zfs-rules
+ namespace: monitoring
+ labels:
+ release: prometheus
+spec:
+ groups:
+ - name: freebsd-zfs-arc
+ interval: 30s
+ rules:
+ - record: node_zfs_arc_hit_rate_percent
+ expr: |
+ 100 * (
+ rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) /
+ (rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) +
+ rate(node_zfs_arcstats_misses_total{os="freebsd"}[5m]))
+ )
+ labels:
+ os: freebsd
+ - record: node_zfs_arc_memory_usage_percent
+ expr: |
+ 100 * (
+ node_zfs_arcstats_size_bytes{os="freebsd"} /
+ node_zfs_arcstats_c_max_bytes{os="freebsd"}
+ )
+ labels:
+ os: freebsd
+ # Additional rules for metadata %, target %, MRU/MFU %, etc.
+```
+
+These recording rules calculate:
+
+* ARC hit rate percentage
+* ARC memory usage percentage (current vs maximum)
+* ARC target percentage (target vs maximum)
+* Metadata vs data percentages
+* MRU vs MFU cache percentages
+* Demand data and metadata hit rates
+
+### Grafana Dashboards
+
+Created two comprehensive ZFS monitoring dashboards (zfs-dashboards.yaml):
+
+**Dashboard 1: FreeBSD ZFS (per-host detailed view)**
+
+Includes variables to select:
+
+* FreeBSD server (f0, f1, or f2)
+* ZFS pool (zdata, zroot, or all)
+
+Pool Overview Row:
+
+* Pool Capacity gauge (with thresholds: green <70%, yellow <85%, red >85%)
+* Pool Health status (ONLINE/DEGRADED/FAULTED with color coding)
+* Total Pool Size stat
+* Free Space stat
+* Pool Space Usage Over Time (stacked: used + free)
+* Pool Capacity Trend time series
+
+Dataset Statistics Row:
+
+* Table showing all datasets with columns: Pool, Dataset, Used, Available, Referenced
+* Automatically filters by selected pool
+
+ARC Cache Statistics Row:
+
+* ARC Hit Rate gauge (red <70%, yellow <90%, green >=90%)
+* ARC Size time series (current, target, max)
+* ARC Memory Usage percentage gauge
+* ARC Hits vs Misses rate
+* ARC Data vs Metadata stacked time series
+
+**Dashboard 2: FreeBSD ZFS Summary (cluster-wide overview)**
+
+Cluster-Wide Pool Statistics Row:
+
+* Total Storage Capacity across all servers
+* Total Used space
+* Total Free space
+* Average Pool Capacity gauge
+* Pool Health Status (worst case across cluster)
+* Total Pool Space Usage Over Time
+* Per-Pool Capacity time series (all pools on all hosts)
+
+Per-Host Pool Breakdown Row:
+
+* Bar gauge showing capacity by host and pool
+* Table with all pools: Host, Pool, Size, Used, Free, Capacity %, Health
+
+Cluster-Wide ARC Statistics Row:
+
+* Average ARC Hit Rate gauge across all hosts
+* ARC Hit Rate by Host time series
+* Total ARC Size Across Cluster
+* Total ARC Hits vs Misses (cluster-wide sum)
+* ARC Size by Host
+
+Dashboard Visualization:
+
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-dashboard.png ZFS monitoring dashboard in Grafana showing pool capacity, health, and I/O throughput
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-arc-stats.png ZFS ARC cache statistics showing hit rate, memory usage, and size trends
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-datasets.png ZFS datasets table and ARC data vs metadata breakdown
+
+### Deployment
+
+Applied the resources to the cluster:
+
+```
+cd /home/paul/git/conf/f3s/prometheus
+kubectl apply -f zfs-recording-rules.yaml
+kubectl apply -f zfs-dashboards.yaml
+```
+
+Updated Justfile to include ZFS recording rules in install and upgrade targets:
+
+```
+install:
+ kubectl apply -f persistent-volumes.yaml
+ kubectl create secret generic additional-scrape-configs --from-file=additional-scrape-configs.yaml -n monitoring --dry-run=client -o yaml | kubectl apply -f -
+ helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring -f persistence-values.yaml
+ kubectl apply -f freebsd-recording-rules.yaml
+ kubectl apply -f openbsd-recording-rules.yaml
+ kubectl apply -f zfs-recording-rules.yaml
+ just -f grafana-ingress/Justfile install
+```
+
+### Verifying ZFS Metrics in Prometheus
+
+Check that ZFS metrics are being collected:
+
+```
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- 'http://localhost:9090/api/v1/query?query=node_zfs_arcstats_size_bytes'
+```
+
+Check recording rules are calculating correctly:
+
+```
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- 'http://localhost:9090/api/v1/query?query=node_zfs_arc_memory_usage_percent'
+```
+
+Example output shows memory usage percentage for each FreeBSD server:
+
+```
+"result":[
+ {"metric":{"instance":"192.168.2.130:9100","os":"freebsd"},"value":[...,"37.58"]},
+ {"metric":{"instance":"192.168.2.131:9100","os":"freebsd"},"value":[...,"12.85"]},
+ {"metric":{"instance":"192.168.2.132:9100","os":"freebsd"},"value":[...,"13.44"]}
+]
+```
+
+### Key Metrics to Monitor
+
+* ARC Hit Rate: Should typically be above 90% for optimal performance. Lower hit rates indicate the ARC cache is too small or workload has poor locality.
+* ARC Memory Usage: Shows how much of the maximum ARC size is being used. If consistently at or near maximum, the ARC is effectively utilizing available memory.
+* Data vs Metadata: Typically data should dominate, but workloads with many small files will show higher metadata percentages.
+* MRU vs MFU: Most Recently Used vs Most Frequently Used cache. The ratio depends on workload characteristics.
+* Pool Capacity: Monitor pool usage to ensure adequate free space. ZFS performance degrades when pools exceed 80% capacity.
+* Pool Health: Should always show ONLINE (green). DEGRADED (yellow) indicates a disk issue requiring attention. FAULTED (red) requires immediate action.
+* Dataset Usage: Track which datasets are consuming the most space to identify growth trends and plan capacity.
+
+### ZFS Pool and Dataset Metrics via Textfile Collector
+
+To complement the ARC statistics from node_exporter's built-in ZFS collector, I added pool capacity and dataset metrics using the textfile collector feature.
+
+Created a script at `/usr/local/bin/zfs_pool_metrics.sh` on each FreeBSD server:
+
+```
+#!/bin/sh
+# ZFS Pool and Dataset Metrics Collector for Prometheus
+
+OUTPUT_FILE="/var/tmp/node_exporter/zfs_pools.prom.$$"
+FINAL_FILE="/var/tmp/node_exporter/zfs_pools.prom"
+
+mkdir -p /var/tmp/node_exporter
+
+{
+ # Pool metrics
+ echo "# HELP zfs_pool_size_bytes Total size of ZFS pool"
+ echo "# TYPE zfs_pool_size_bytes gauge"
+ echo "# HELP zfs_pool_allocated_bytes Allocated space in ZFS pool"
+ echo "# TYPE zfs_pool_allocated_bytes gauge"
+ echo "# HELP zfs_pool_free_bytes Free space in ZFS pool"
+ echo "# TYPE zfs_pool_free_bytes gauge"
+ echo "# HELP zfs_pool_capacity_percent Capacity percentage"
+ echo "# TYPE zfs_pool_capacity_percent gauge"
+ echo "# HELP zfs_pool_health Pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)"
+ echo "# TYPE zfs_pool_health gauge"
+
+ zpool list -Hp -o name,size,allocated,free,capacity,health | \
+ while IFS=$'\t' read name size alloc free cap health; do
+ case "$health" in
+ ONLINE) health_val=0 ;;
+ DEGRADED) health_val=1 ;;
+ FAULTED) health_val=2 ;;
+ *) health_val=6 ;;
+ esac
+ cap_num=$(echo "$cap" | sed 's/%//')
+
+ echo "zfs_pool_size_bytes{pool=\"$name\"} $size"
+ echo "zfs_pool_allocated_bytes{pool=\"$name\"} $alloc"
+ echo "zfs_pool_free_bytes{pool=\"$name\"} $free"
+ echo "zfs_pool_capacity_percent{pool=\"$name\"} $cap_num"
+ echo "zfs_pool_health{pool=\"$name\"} $health_val"
+ done
+
+ # Dataset metrics
+ echo "# HELP zfs_dataset_used_bytes Used space in dataset"
+ echo "# TYPE zfs_dataset_used_bytes gauge"
+ echo "# HELP zfs_dataset_available_bytes Available space"
+ echo "# TYPE zfs_dataset_available_bytes gauge"
+ echo "# HELP zfs_dataset_referenced_bytes Referenced space"
+ echo "# TYPE zfs_dataset_referenced_bytes gauge"
+
+ zfs list -Hp -t filesystem -o name,used,available,referenced | \
+ while IFS=$'\t' read name used avail ref; do
+ pool=$(echo "$name" | cut -d/ -f1)
+ echo "zfs_dataset_used_bytes{pool=\"$pool\",dataset=\"$name\"} $used"
+ echo "zfs_dataset_available_bytes{pool=\"$pool\",dataset=\"$name\"} $avail"
+ echo "zfs_dataset_referenced_bytes{pool=\"$pool\",dataset=\"$name\"} $ref"
+ done
+} > "$OUTPUT_FILE"
+
+mv "$OUTPUT_FILE" "$FINAL_FILE"
+```
+
+Deployed to all FreeBSD servers:
+
+```
+for host in f0 f1 f2; do
+ scp /tmp/zfs_pool_metrics.sh paul@$host:/tmp/
+ ssh paul@$host 'doas mv /tmp/zfs_pool_metrics.sh /usr/local/bin/ && \
+ doas chmod +x /usr/local/bin/zfs_pool_metrics.sh'
+done
+```
+
+Set up cron jobs to run every minute:
+
+```
+for host in f0 f1 f2; do
+ ssh paul@$host 'echo "* * * * * /usr/local/bin/zfs_pool_metrics.sh >/dev/null 2>&1" | \
+ doas crontab -'
+done
+```
+
+The textfile collector (already configured with --collector.textfile.directory=/var/tmp/node_exporter) automatically picks up the metrics.
+
+Verify metrics are being exposed:
+
+```
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep "^zfs_pool" | head -5
+zfs_pool_allocated_bytes{pool="zdata"} 6.47622733824e+11
+zfs_pool_allocated_bytes{pool="zroot"} 5.3338578944e+10
+zfs_pool_capacity_percent{pool="zdata"} 64
+zfs_pool_capacity_percent{pool="zroot"} 10
+zfs_pool_free_bytes{pool="zdata"} 3.48809678848e+11
+```
+
+> Updated Mon 09 Mar: Added section about distributed tracing with Grafana Tempo
+
+## Distributed Tracing with Grafana Tempo
+
+After implementing logs (Loki) and metrics (Prometheus), the final pillar of observability is distributed tracing. Grafana Tempo provides distributed tracing capabilities that help understand request flows across microservices.
+
+How will this look tracing with Tempo like in Grafana? Have a look at the X-RAG blog post of mine:
+
+=> ./2025-12-24-x-rag-observability-hackathon.gmi X-RAG Observability Hackathon
+
+### Why Distributed Tracing?
+
+In a microservices architecture, a single user request may traverse multiple services. Distributed tracing:
+
+* Tracks requests across service boundaries
+* Identifies performance bottlenecks
+* Visualizes service dependencies
+* Correlates with logs and metrics
+* Helps debug complex distributed systems
+
+### Deploying Grafana Tempo
+
+Tempo is deployed in monolithic mode, following the same pattern as Loki's SingleBinary deployment.
+
+#### Configuration Strategy
+
+**Deployment Mode:** Monolithic (all components in one process)
+* Simpler operation than microservices mode
+* Suitable for the cluster scale
+* Consistent with Loki deployment pattern
+
+**Storage:** Filesystem backend using hostPath
+* 10Gi storage at /data/nfs/k3svolumes/tempo/data
+* 7-day retention (168h)
+* Local storage is the only option for monolithic mode
+
+**OTLP Receivers:** Standard OpenTelemetry Protocol ports
+* gRPC: 4317
+* HTTP: 4318
+* Bind to 0.0.0.0 to avoid Tempo 2.7+ localhost-only binding issue
+
+#### Tempo Deployment Files
+
+Created in /home/paul/git/conf/f3s/tempo/:
+
+**values.yaml** - Helm chart configuration:
+
+```
+tempo:
+ retention: 168h
+ storage:
+ trace:
+ backend: local
+ local:
+ path: /var/tempo/traces
+ wal:
+ path: /var/tempo/wal
+ receivers:
+ otlp:
+ protocols:
+ grpc:
+ endpoint: 0.0.0.0:4317
+ http:
+ endpoint: 0.0.0.0:4318
+
+persistence:
+ enabled: true
+ size: 10Gi
+ storageClassName: ""
+
+resources:
+ limits:
+ cpu: 1000m
+ memory: 2Gi
+ requests:
+ cpu: 500m
+ memory: 1Gi
+```
+
+**persistent-volumes.yaml** - Storage configuration:
+
+```
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: tempo-data-pv
+spec:
+ capacity:
+ storage: 10Gi
+ accessModes:
+ - ReadWriteOnce
+ persistentVolumeReclaimPolicy: Retain
+ hostPath:
+ path: /data/nfs/k3svolumes/tempo/data
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: tempo-data-pvc
+ namespace: monitoring
+spec:
+ storageClassName: ""
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 10Gi
+```
+
+**Grafana Datasource Provisioning**
+
+All Grafana datasources (Prometheus, Alertmanager, Loki, Tempo) are provisioned via a unified ConfigMap that is directly mounted to the Grafana pod. This approach ensures datasources are loaded on startup without requiring sidecar-based discovery.
+
+In /home/paul/git/conf/f3s/prometheus/grafana-datasources-all.yaml:
+
+```
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: grafana-datasources-all
+ namespace: monitoring
+data:
+ datasources.yaml: |
+ apiVersion: 1
+ datasources:
+ - name: Prometheus
+ type: prometheus
+ uid: prometheus
+ url: http://prometheus-kube-prometheus-prometheus.monitoring:9090/
+ access: proxy
+ isDefault: true
+ - name: Alertmanager
+ type: alertmanager
+ uid: alertmanager
+ url: http://prometheus-kube-prometheus-alertmanager.monitoring:9093/
+ - name: Loki
+ type: loki
+ uid: loki
+ url: http://loki.monitoring.svc.cluster.local:3100
+ - name: Tempo
+ type: tempo
+ uid: tempo
+ url: http://tempo.monitoring.svc.cluster.local:3200
+ jsonData:
+ tracesToLogsV2:
+ datasourceUid: loki
+ spanStartTimeShift: -1h
+ spanEndTimeShift: 1h
+ tracesToMetrics:
+ datasourceUid: prometheus
+ serviceMap:
+ datasourceUid: prometheus
+ nodeGraph:
+ enabled: true
+```
+
+The kube-prometheus-stack Helm values (persistence-values.yaml) are configured to:
+* Disable sidecar-based datasource provisioning
+* Mount grafana-datasources-all ConfigMap directly to /etc/grafana/provisioning/datasources/
+
+This direct mounting approach is simpler and more reliable than sidecar-based discovery.
+
+#### Installation
+
+```
+cd /home/paul/git/conf/f3s/tempo
+just install
+```
+
+Verify Tempo is running:
+
+```
+kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
+kubectl exec -n monitoring <tempo-pod> -- wget -qO- http://localhost:3200/ready
+```
+
+### Configuring Grafana Alloy for Trace Collection
+
+Updated /home/paul/git/conf/f3s/loki/alloy-values.yaml to add OTLP receivers for traces while maintaining existing log collection.
+
+#### OTLP Receiver Configuration
+
+Added to Alloy configuration after the log collection pipeline:
+
+```
+// OTLP receiver for traces via gRPC and HTTP
+otelcol.receiver.otlp "default" {
+ grpc {
+ endpoint = "0.0.0.0:4317"
+ }
+ http {
+ endpoint = "0.0.0.0:4318"
+ }
+ output {
+ traces = [otelcol.processor.batch.default.input]
+ }
+}
+
+// Batch processor for efficient trace forwarding
+otelcol.processor.batch "default" {
+ timeout = "5s"
+ send_batch_size = 100
+ send_batch_max_size = 200
+ output {
+ traces = [otelcol.exporter.otlp.tempo.input]
+ }
+}
+
+// OTLP exporter to send traces to Tempo
+otelcol.exporter.otlp "tempo" {
+ client {
+ endpoint = "tempo.monitoring.svc.cluster.local:4317"
+ tls {
+ insecure = true
+ }
+ compression = "gzip"
+ }
+}
+```
+
+The batch processor reduces network overhead by accumulating spans before forwarding to Tempo.
+
+#### Upgrade Alloy
+
+```
+cd /home/paul/git/conf/f3s/loki
+just upgrade
+```
+
+Verify OTLP receivers are listening:
+
+```
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i "otlp.*receiver"
+kubectl exec -n monitoring <alloy-pod> -- netstat -ln | grep -E ':(4317|4318)'
+```
+
+### Demo Tracing Application
+
+Created a three-tier Python application to demonstrate distributed tracing in action.
+
+#### Application Architecture
+
+```
+User → Frontend (Flask:5000) → Middleware (Flask:5001) → Backend (Flask:5002)
+ ↓ ↓ ↓
+ Alloy (OTLP:4317) → Tempo → Grafana
+```
+
+Frontend Service:
+
+* Receives HTTP requests at /api/process
+* Forwards to middleware service
+* Creates parent span for the entire request
+
+Middleware Service:
+
+* Transforms data at /api/transform
+* Calls backend service
+* Creates child span linked to frontend
+
+Backend Service:
+
+* Returns data at /api/data
+* Simulates database query (100ms sleep)
+* Creates leaf span in the trace
+
+OpenTelemetry Instrumentation:
+
+All services use Python OpenTelemetry libraries:
+
+**Dependencies:**
+```
+flask==3.0.0
+requests==2.31.0
+opentelemetry-distro==0.49b0
+opentelemetry-exporter-otlp==1.28.0
+opentelemetry-instrumentation-flask==0.49b0
+opentelemetry-instrumentation-requests==0.49b0
+```
+
+**Auto-instrumentation pattern** (used in all services):
+
+```python
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
+from opentelemetry.instrumentation.requests import RequestsInstrumentor
+from opentelemetry.sdk.resources import Resource
+
+# Define service identity
+resource = Resource(attributes={
+ "service.name": "frontend",
+ "service.namespace": "tracing-demo",
+ "service.version": "1.0.0"
+})
+
+provider = TracerProvider(resource=resource)
+
+# Export to Alloy
+otlp_exporter = OTLPSpanExporter(
+ endpoint="http://alloy.monitoring.svc.cluster.local:4317",
+ insecure=True
+)
+
+processor = BatchSpanProcessor(otlp_exporter)
+provider.add_span_processor(processor)
+trace.set_tracer_provider(provider)
+
+# Auto-instrument Flask and requests
+FlaskInstrumentor().instrument_app(app)
+RequestsInstrumentor().instrument()
+```
+
+The auto-instrumentation automatically:
+* Creates spans for HTTP requests
+* Propagates trace context via W3C Trace Context headers
+* Links parent and child spans across service boundaries
+
+Deployment:
+
+Created Helm chart in /home/paul/git/conf/f3s/tracing-demo/ with three separate deployments, services, and an ingress.
+
+Build and deploy:
+
+```
+cd /home/paul/git/conf/f3s/tracing-demo
+just build
+just import
+just install
+```
+
+Verify deployment:
+
+```
+kubectl get pods -n services | grep tracing-demo
+kubectl get ingress -n services tracing-demo-ingress
+```
+
+Access the application at:
+
+=> http://tracing-demo.f3s.buetow.org
+
+### Visualizing Traces in Grafana
+
+The Tempo datasource is automatically discovered by Grafana through the ConfigMap label.
+
+#### Accessing Traces
+
+Navigate to Grafana → Explore → Select "Tempo" datasource
+
+**Search Interface:**
+* Search by Trace ID
+* Search by service name
+* Search by tags
+
+**TraceQL Queries:**
+
+Find all traces from demo app:
+```
+{ resource.service.namespace = "tracing-demo" }
+```
+
+Find slow requests (>200ms):
+```
+{ duration > 200ms }
+```
+
+Find traces from specific service:
+```
+{ resource.service.name = "frontend" }
+```
+
+Find errors:
+```
+{ status = error }
+```
+
+Complex query - frontend traces calling middleware:
+```
+{ resource.service.namespace = "tracing-demo" } && { span.http.status_code >= 500 }
+```
+
+#### Service Graph Visualization
+
+The service graph shows visual connections between services:
+
+1. Navigate to Explore → Tempo
+2. Enable "Service Graph" view
+3. Shows: Frontend → Middleware → Backend with request rates
+
+The service graph uses Prometheus metrics generated from trace data.
+
+### Correlation Between Observability Signals
+
+Tempo integrates with Loki and Prometheus to provide unified observability.
+
+#### Traces-to-Logs
+
+Click on any span in a trace to see related logs:
+
+1. View trace in Grafana
+2. Click on a span
+3. Select "Logs for this span"
+4. Loki shows logs filtered by:
+ * Time range (span duration ± 1 hour)
+ * Service name
+ * Namespace
+ * Pod
+
+This helps correlate what the service was doing when the span was created.
+
+#### Traces-to-Metrics
+
+View Prometheus metrics for services in the trace:
+
+1. View trace in Grafana
+2. Select "Metrics" tab
+3. Shows metrics like:
+ * Request rate
+ * Error rate
+ * Duration percentiles
+
+#### Logs-to-Traces
+
+From logs, you can jump to related traces:
+
+1. In Loki, logs that contain trace IDs are automatically linked
+2. Click the trace ID to view the full trace
+3. See the complete request flow
+
+### Generating Traces for Testing
+
+Test the demo application:
+
+```
+curl http://tracing-demo.f3s.buetow.org/api/process
+```
+
+Load test (generates 50 traces):
+
+```
+cd /home/paul/git/conf/f3s/tracing-demo
+just load-test
+```
+
+Each request creates a distributed trace spanning all three services.
+
+### Verifying the Complete Pipeline
+
+Check the trace flow end-to-end:
+
+**1. Application generates traces:**
+```
+kubectl logs -n services -l app=tracing-demo-frontend | grep -i trace
+```
+
+**2. Alloy receives traces:**
+```
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i otlp
+```
+
+**3. Tempo stores traces:**
+```
+kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i trace
+```
+
+**4. Grafana displays traces:**
+Navigate to Explore → Tempo → Search for traces
+
+### Practical Example: Viewing a Distributed Trace
+
+Let's generate a trace and examine it in Grafana.
+
+**1. Generate a trace by calling the demo application:**
+
+```
+curl -H "Host: tracing-demo.f3s.buetow.org" http://r0/api/process
+```
+
+**Response (HTTP 200):**
+
+```json
+{
+ "middleware_response": {
+ "backend_data": {
+ "data": {
+ "id": 12345,
+ "query_time_ms": 100.0,
+ "timestamp": "2025-12-28T18:35:01.064538",
+ "value": "Sample data from backend service"
+ },
+ "service": "backend"
+ },
+ "middleware_processed": true,
+ "original_data": {
+ "source": "GET request"
+ },
+ "transformation_time_ms": 50
+ },
+ "request_data": {
+ "source": "GET request"
+ },
+ "service": "frontend",
+ "status": "success"
+}
+```
+
+**2. Find the trace in Tempo via API:**
+
+After a few seconds (for batch export), search for recent traces:
+
+```
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ 'http://localhost:3200/api/search?tags=service.namespace%3Dtracing-demo&limit=5' 2>/dev/null | \
+ python3 -m json.tool
+```
+
+Returns traces including:
+
+```json
+{
+ "traceID": "4be1151c0bdcd5625ac7e02b98d95bd5",
+ "rootServiceName": "frontend",
+ "rootTraceName": "GET /api/process",
+ "durationMs": 221
+}
+```
+
+**3. Fetch complete trace details:**
+
+```
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ 'http://localhost:3200/api/traces/4be1151c0bdcd5625ac7e02b98d95bd5' 2>/dev/null | \
+ python3 -m json.tool
+```
+
+**Trace structure (8 spans across 3 services):**
+
+```
+Trace ID: 4be1151c0bdcd5625ac7e02b98d95bd5
+Services: 3 (frontend, middleware, backend)
+
+Service: frontend
+ └─ GET /api/process 221.10ms (HTTP server span)
+ └─ frontend-process 216.23ms (custom business logic span)
+ └─ POST 209.97ms (HTTP client span to middleware)
+
+Service: middleware
+ └─ POST /api/transform 186.02ms (HTTP server span)
+ └─ middleware-transform 180.96ms (custom business logic span)
+ └─ GET 127.52ms (HTTP client span to backend)
+
+Service: backend
+ └─ GET /api/data 103.93ms (HTTP server span)
+ └─ backend-get-data 102.11ms (custom business logic span with 100ms sleep)
+```
+
+**4. View the trace in Grafana UI:**
+
+Navigate to: Grafana → Explore → Tempo datasource
+
+Search using TraceQL:
+```
+{ resource.service.namespace = "tracing-demo" }
+```
+
+Or directly open the trace by pasting the trace ID in the search box:
+```
+4be1151c0bdcd5625ac7e02b98d95bd5
+```
+
+**5. Trace visualization:**
+
+The trace waterfall view in Grafana shows the complete request flow with timing:
+
+=> ./f3s-kubernetes-with-freebsd-part-8/grafana-tempo-trace.png Distributed trace visualization in Grafana Tempo showing Frontend → Middleware → Backend spans
+
+For additional examples of Tempo trace visualization, see also:
+
+=> https://foo.zone/gemfeed/2025-12-24-x-rag-observability-hackathon.html X-RAG Observability Hackathon (more Grafana Tempo screenshots)
+
+The trace reveals the distributed request flow:
+
+* Frontend (221ms): Receives GET /api/process, executes business logic, calls middleware
+* Middleware (186ms): Receives POST /api/transform, transforms data, calls backend
+* Backend (104ms): Receives GET /api/data, simulates database query with 100ms sleep
+* Total request time: 221ms end-to-end
+* Span propagation: W3C Trace Context headers automatically link all spans
+
+**6. Service graph visualization:**
+
+The service graph is automatically generated from traces and shows service dependencies. For examples of service graph visualization in Grafana, see the screenshots in the X-RAG Observability Hackathon blog post.
+
+=> ./2025-12-24-x-rag-observability-hackathon.gmi X-RAG Observability Hackathon (includes service graph screenshots)
+
+This visualization helps identify:
+
+* Request rates between services
+* Average latency for each hop
+* Error rates (if any)
+* Service dependencies and communication patterns
+
+### Storage and Retention
+
+Monitor Tempo storage usage:
+
+```
+kubectl exec -n monitoring <tempo-pod> -- df -h /var/tempo
+```
+
+With 10Gi storage and 7-day retention, the system handles moderate trace volumes. If storage fills up:
+
+* Reduce retention to 72h (3 days)
+* Implement sampling in Alloy
+* Increase PV size
+
+### Configuration Files
+
+All configuration files are available on Codeberg:
+
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/tempo Tempo configuration
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/loki Alloy configuration (updated for traces)
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/tracing-demo Demo tracing application
+
## Summary
-With Prometheus, Grafana, Loki, and Alloy deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:
+With Prometheus, Grafana, Loki, Alloy, and Tempo deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:
-* metrics: Prometheus collects and stores time-series data from all components
+* Metrics: Prometheus collects and stores time-series data from all components, including etcd and ZFS
* Logs: Loki aggregates logs from all containers, searchable via Grafana
-* Visualisation: Grafana provides dashboards and exploration tools
+* Traces: Tempo provides distributed request tracing with service dependency mapping
+* Visualisation: Grafana provides dashboards and exploration tools with correlation between all three signals
* Alerting: Alertmanager can notify on conditions defined in Prometheus rules
This observability stack runs entirely on the home lab infrastructure, with data persisted to the NFS share. It's lightweight enough for a three-node cluster but provides the same capabilities as production-grade setups.
+=> https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus prometheus configuration on Codeberg
+
Other *BSD-related posts:
<< template::inline::rindex bsd
diff --git a/gemfeed/atom.xml b/gemfeed/atom.xml
index d4f9ade4..17afd024 100644
--- a/gemfeed/atom.xml
+++ b/gemfeed/atom.xml
@@ -1,6 +1,6 @@
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
- <updated>2026-03-03T09:08:49+02:00</updated>
+ <updated>2026-03-09T08:44:03+02:00</updated>
<title>foo.zone feed</title>
<subtitle>To be in the .zone!</subtitle>
<link href="gemini://foo.zone/gemfeed/atom.xml" rel="self" />
@@ -3521,7 +3521,7 @@ $ curl -s -G "http://localhost:3200/api/search" \
<title>f3s: Kubernetes with FreeBSD - Part 8: Observability</title>
<link href="gemini://foo.zone/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi" />
<id>gemini://foo.zone/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.gmi</id>
- <updated>2025-12-06T23:58:24+02:00</updated>
+ <updated>2025-12-06T23:58:24+02:00, last updated Mon 09 Mar 09:33:08 EET 2026</updated>
<author>
<name>Paul Buetow aka snonux</name>
<email>paul@dev.buetow.org</email>
@@ -3531,7 +3531,7 @@ $ curl -s -G "http://localhost:3200/api/search" \
<div xmlns="http://www.w3.org/1999/xhtml">
<h1 style='display: inline' id='f3s-kubernetes-with-freebsd---part-8-observability'>f3s: Kubernetes with FreeBSD - Part 8: Observability</h1><br />
<br />
-<span class='quote'>Published at 2025-12-06T23:58:24+02:00</span><br />
+<span class='quote'>Published at 2025-12-06T23:58:24+02:00, last updated Mon 09 Mar 09:33:08 EET 2026</span><br />
<br />
<span>This is the 8th blog post about the f3s series for my self-hosting demands in a home lab. f3s? The "f" stands for FreeBSD, and the "3s" stands for k3s, the Kubernetes distribution I use on FreeBSD-based physical machines.</span><br />
<br />
@@ -3577,6 +3577,46 @@ $ curl -s -G "http://localhost:3200/api/search" \
<li>⇢ ⇢ <a href='#installing-node-exporter-on-openbsd'>Installing Node Exporter on OpenBSD</a></li>
<li>⇢ ⇢ <a href='#adding-openbsd-hosts-to-prometheus'>Adding OpenBSD hosts to Prometheus</a></li>
<li>⇢ ⇢ <a href='#openbsd-memory-metrics-compatibility'>OpenBSD memory metrics compatibility</a></li>
+<li>⇢ <a href='#enabling-etcd-metrics-in-k3s'>Enabling etcd metrics in k3s</a></li>
+<li>⇢ ⇢ <a href='#configuring-prometheus-to-scrape-etcd'>Configuring Prometheus to scrape etcd</a></li>
+<li>⇢ ⇢ <a href='#verifying-etcd-metrics'>Verifying etcd metrics</a></li>
+<li>⇢ ⇢ <a href='#complete-persistence-valuesyaml'>Complete persistence-values.yaml</a></li>
+<li>⇢ <a href='#zfs-monitoring-for-freebsd-servers'>ZFS Monitoring for FreeBSD Servers</a></li>
+<li>⇢ ⇢ <a href='#node-exporter-zfs-collector'>Node Exporter ZFS Collector</a></li>
+<li>⇢ ⇢ <a href='#verifying-zfs-metrics'>Verifying ZFS Metrics</a></li>
+<li>⇢ ⇢ <a href='#zfs-recording-rules'>ZFS Recording Rules</a></li>
+<li>⇢ ⇢ <a href='#grafana-dashboards'>Grafana Dashboards</a></li>
+<li>⇢ ⇢ <a href='#deployment'>Deployment</a></li>
+<li>⇢ ⇢ <a href='#verifying-zfs-metrics-in-prometheus'>Verifying ZFS Metrics in Prometheus</a></li>
+<li>⇢ ⇢ <a href='#accessing-the-dashboards'>Accessing the Dashboards</a></li>
+<li>⇢ ⇢ <a href='#key-metrics-to-monitor'>Key Metrics to Monitor</a></li>
+<li>⇢ ⇢ <a href='#zfs-pool-and-dataset-metrics-via-textfile-collector'>ZFS Pool and Dataset Metrics via Textfile Collector</a></li>
+<li>⇢ <a href='#distributed-tracing-with-grafana-tempo'>Distributed Tracing with Grafana Tempo</a></li>
+<li>⇢ ⇢ <a href='#why-distributed-tracing'>Why Distributed Tracing?</a></li>
+<li>⇢ ⇢ <a href='#deploying-grafana-tempo'>Deploying Grafana Tempo</a></li>
+<li>⇢ <a href='#-configuration-strategy'>⇢# Configuration Strategy</a></li>
+<li>⇢ <a href='#-tempo-deployment-files'>⇢# Tempo Deployment Files</a></li>
+<li>⇢ <a href='#-installation'>⇢# Installation</a></li>
+<li>⇢ ⇢ <a href='#configuring-grafana-alloy-for-trace-collection'>Configuring Grafana Alloy for Trace Collection</a></li>
+<li>⇢ <a href='#-otlp-receiver-configuration'>⇢# OTLP Receiver Configuration</a></li>
+<li>⇢ <a href='#-upgrade-alloy'>⇢# Upgrade Alloy</a></li>
+<li>⇢ ⇢ <a href='#demo-tracing-application'>Demo Tracing Application</a></li>
+<li>⇢ <a href='#-application-architecture'>⇢# Application Architecture</a></li>
+<li>⇢ <a href='#-opentelemetry-instrumentation'>⇢# OpenTelemetry Instrumentation</a></li>
+<li>⇢ <a href='#-deployment'>⇢# Deployment</a></li>
+<li>⇢ ⇢ <a href='#visualizing-traces-in-grafana'>Visualizing Traces in Grafana</a></li>
+<li>⇢ <a href='#-accessing-traces'>⇢# Accessing Traces</a></li>
+<li>⇢ <a href='#-service-graph-visualization'>⇢# Service Graph Visualization</a></li>
+<li>⇢ ⇢ <a href='#correlation-between-observability-signals'>Correlation Between Observability Signals</a></li>
+<li>⇢ <a href='#-traces-to-logs'>⇢# Traces-to-Logs</a></li>
+<li>⇢ <a href='#-traces-to-metrics'>⇢# Traces-to-Metrics</a></li>
+<li>⇢ <a href='#-logs-to-traces'>⇢# Logs-to-Traces</a></li>
+<li>⇢ ⇢ <a href='#generating-traces-for-testing'>Generating Traces for Testing</a></li>
+<li>⇢ ⇢ <a href='#verifying-the-complete-pipeline'>Verifying the Complete Pipeline</a></li>
+<li>⇢ ⇢ <a href='#practical-example-viewing-a-distributed-trace'>Practical Example: Viewing a Distributed Trace</a></li>
+<li>⇢ ⇢ <a href='#storage-and-retention'>Storage and Retention</a></li>
+<li>⇢ ⇢ <a href='#complete-observability-stack'>Complete Observability Stack</a></li>
+<li>⇢ ⇢ <a href='#configuration-files'>Configuration Files</a></li>
<li>⇢ <a href='#summary'>Summary</a></li>
</ul><br />
<h2 style='display: inline' id='introduction'>Introduction</h2><br />
@@ -3605,10 +3645,10 @@ $ curl -s -G "http://localhost:3200/api/search" \
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ git clone https://codeberg.org/snonux/conf.git
-$ cd conf
-$ git checkout 15a86f3 <i><font color="silver"># Last commit before ArgoCD migration</font></i>
-$ cd f3s/prometheus/
+<pre><font color="#ff0000">$ git clone https</font><font color="#F3E651">:</font><font color="#ff0000">//codeberg</font><font color="#F3E651">.</font><font color="#ff0000">org/snonux/conf</font><font color="#F3E651">.</font><font color="#ff0000">git</font>
+<font color="#ff0000">$ cd conf</font>
+<font color="#ff0000">$ git checkout 15a86f3 </font><i><font color="#ababab"># Last commit before ArgoCD migration</font></i>
+<font color="#ff0000">$ cd f3s/prometheus</font><font color="#F3E651">/</font>
</pre>
<br />
<span>**Current master branch** contains the ArgoCD-managed versions with:</span><br />
@@ -3644,8 +3684,8 @@ $ cd f3s/prometheus/
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ kubectl create namespace monitoring
-namespace/monitoring created
+<pre><font color="#ff0000">$ kubectl create namespace monitoring</font>
+<font color="#ff0000">namespace/monitoring created</font>
</pre>
<br />
<h2 style='display: inline' id='installing-prometheus-and-grafana'>Installing Prometheus and Grafana</h2><br />
@@ -3660,8 +3700,8 @@ namespace/monitoring created
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
-$ helm repo update
+<pre><font color="#ff0000">$ helm repo add prometheus-community https</font><font color="#F3E651">:</font><font color="#ff0000">//prometheus-community</font><font color="#F3E651">.</font><font color="#ff0000">github</font><font color="#F3E651">.</font><font color="#ff0000">io/helm-charts</font>
+<font color="#ff0000">$ helm repo update</font>
</pre>
<br />
<span>Create the directories on the NFS server for persistent storage:</span><br />
@@ -3670,8 +3710,8 @@ $ helm repo update
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>[root@r0 ~]<i><font color="silver"># mkdir -p /data/nfs/k3svolumes/prometheus/data</font></i>
-[root@r0 ~]<i><font color="silver"># mkdir -p /data/nfs/k3svolumes/grafana/data</font></i>
+<pre><font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># mkdir -p /data/nfs/k3svolumes/prometheus/data</font></i>
+<font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># mkdir -p /data/nfs/k3svolumes/grafana/data</font></i>
</pre>
<br />
<h3 style='display: inline' id='deploying-with-the-justfile'>Deploying with the Justfile</h3><br />
@@ -3687,18 +3727,18 @@ http://www.gnu.org/software/src-highlite -->
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ cd conf/f3s/prometheus
-$ just install
-kubectl apply -f persistent-volumes.yaml
-persistentvolume/prometheus-data-pv created
-persistentvolume/grafana-data-pv created
-persistentvolumeclaim/grafana-data-pvc created
-helm install prometheus prometheus-community/kube-prometheus-stack \
- --namespace monitoring -f persistence-values.yaml
-NAME: prometheus
-LAST DEPLOYED: ...
-NAMESPACE: monitoring
-STATUS: deployed
+<pre><font color="#ff0000">$ cd conf/f3s/prometheus</font>
+<font color="#ff0000">$ just install</font>
+<font color="#ff0000">kubectl apply -f persistent-volumes</font><font color="#F3E651">.</font><font color="#ff0000">yaml</font>
+<font color="#ff0000">persistentvolume/prometheus-data-pv created</font>
+<font color="#ff0000">persistentvolume/grafana-data-pv created</font>
+<font color="#ff0000">persistentvolumeclaim/grafana-data-pvc created</font>
+<font color="#ff0000">helm install prometheus prometheus-community/kube-prometheus-stack </font><font color="#F3E651">\</font>
+<font color="#ff0000"> --namespace monitoring -f persistence-values</font><font color="#F3E651">.</font><font color="#ff0000">yaml</font>
+<font color="#ff0000">NAME</font><font color="#F3E651">:</font><font color="#ff0000"> prometheus</font>
+<font color="#ff0000">LAST DEPLOYED</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#F3E651">...</font>
+<font color="#ff0000">NAMESPACE</font><font color="#F3E651">:</font><font color="#ff0000"> monitoring</font>
+<font color="#ff0000">STATUS</font><font color="#F3E651">:</font><font color="#ff0000"> deployed</font>
</pre>
<br />
<span>The <span class='inlinecode'>persistence-values.yaml</span> configures Prometheus and Grafana to use the NFS-backed persistent volumes I mentioned earlier, ensuring data survives pod restarts. It also enables scraping of etcd and kube-controller-manager metrics:</span><br />
@@ -3737,11 +3777,11 @@ kubeControllerManager:
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>[root@r0 ~]<i><font color="silver"># cat &gt;&gt; /etc/rancher/k3s/config.yaml &lt;&lt; 'EOF'</font></i>
-kube-controller-manager-arg:
- - bind-address=<font color="#000000">0.0</font>.<font color="#000000">0.0</font>
-EOF
-[root@r0 ~]<i><font color="silver"># systemctl restart k3s</font></i>
+<pre><font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># cat &gt;&gt; /etc/rancher/k3s/config.yaml &lt;&lt; 'EOF'</font></i>
+<font color="#ff0000">kube-controller-manager-arg</font><font color="#F3E651">:</font>
+<font color="#ff0000"> - bind-address</font><font color="#F3E651">=</font><font color="#bb00ff">0.0</font><font color="#F3E651">.</font><font color="#bb00ff">0.0</font>
+<font color="#ff0000">EOF</font>
+<font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># systemctl restart k3s</font></i>
</pre>
<br />
<span>Repeat for <span class='inlinecode'>r1</span> and <span class='inlinecode'>r2</span>. After restarting all nodes, the controller-manager metrics endpoint will be accessible and Prometheus can scrape it.</span><br />
@@ -3760,9 +3800,9 @@ EOF
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ kubectl get svc -n monitoring prometheus-kube-prometheus-prometheus
-NAME TYPE CLUSTER-IP PORT(S)
-prometheus-kube-prometheus-prometheus ClusterIP <font color="#000000">10.43</font>.<font color="#000000">152.163</font> <font color="#000000">9090</font>/TCP,<font color="#000000">8080</font>/TCP
+<pre><font color="#ff0000">$ kubectl get svc -n monitoring prometheus-kube-prometheus-prometheus</font>
+<font color="#ff0000">NAME TYPE CLUSTER-IP PORT</font><font color="#F3E651">(</font><font color="#ff0000">S</font><font color="#F3E651">)</font>
+<font color="#ff0000">prometheus-kube-prometheus-prometheus ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">152.163</font><font color="#ff0000"> </font><font color="#bb00ff">9090</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">8080</font><font color="#ff0000">/TCP</font>
</pre>
<br />
<span>Grafana connects to Prometheus using the internal service URL <span class='inlinecode'>http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090</span>. The default Grafana credentials are <span class='inlinecode'>admin</span>/<span class='inlinecode'>prom-operator</span>, which should be changed immediately after first login.</span><br />
@@ -3785,7 +3825,7 @@ prometheus-kube-prometheus-prometheus ClusterIP <font color="#000000">10.43<
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>[root@r0 ~]<i><font color="silver"># mkdir -p /data/nfs/k3svolumes/loki/data</font></i>
+<pre><font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># mkdir -p /data/nfs/k3svolumes/loki/data</font></i>
</pre>
<br />
<h3 style='display: inline' id='deploying-loki-and-alloy'>Deploying Loki and Alloy</h3><br />
@@ -3800,24 +3840,24 @@ http://www.gnu.org/software/src-highlite -->
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ cd conf/f3s/loki
-$ just install
-helm repo add grafana https://grafana.github.io/helm-charts || <b><u><font color="#000000">true</font></u></b>
-helm repo update
-kubectl apply -f persistent-volumes.yaml
-persistentvolume/loki-data-pv created
-persistentvolumeclaim/loki-data-pvc created
-helm install loki grafana/loki --namespace monitoring -f values.yaml
-NAME: loki
-LAST DEPLOYED: ...
-NAMESPACE: monitoring
-STATUS: deployed
-...
-helm install alloy grafana/alloy --namespace monitoring -f alloy-values.yaml
-NAME: alloy
-LAST DEPLOYED: ...
-NAMESPACE: monitoring
-STATUS: deployed
+<pre><font color="#ff0000">$ cd conf/f3s/loki</font>
+<font color="#ff0000">$ just install</font>
+<font color="#ff0000">helm repo add grafana https</font><font color="#F3E651">:</font><font color="#ff0000">//grafana</font><font color="#F3E651">.</font><font color="#ff0000">github</font><font color="#F3E651">.</font><font color="#ff0000">io/helm-charts </font><font color="#F3E651">||</font><font color="#ff0000"> </font><b><font color="#ffffff">true</font></b>
+<font color="#ff0000">helm repo update</font>
+<font color="#ff0000">kubectl apply -f persistent-volumes</font><font color="#F3E651">.</font><font color="#ff0000">yaml</font>
+<font color="#ff0000">persistentvolume/loki-data-pv created</font>
+<font color="#ff0000">persistentvolumeclaim/loki-data-pvc created</font>
+<font color="#ff0000">helm install loki grafana/loki --namespace monitoring -f values</font><font color="#F3E651">.</font><font color="#ff0000">yaml</font>
+<font color="#ff0000">NAME</font><font color="#F3E651">:</font><font color="#ff0000"> loki</font>
+<font color="#ff0000">LAST DEPLOYED</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#F3E651">...</font>
+<font color="#ff0000">NAMESPACE</font><font color="#F3E651">:</font><font color="#ff0000"> monitoring</font>
+<font color="#ff0000">STATUS</font><font color="#F3E651">:</font><font color="#ff0000"> deployed</font>
+<font color="#F3E651">...</font>
+<font color="#ff0000">helm install alloy grafana/alloy --namespace monitoring -f alloy-values</font><font color="#F3E651">.</font><font color="#ff0000">yaml</font>
+<font color="#ff0000">NAME</font><font color="#F3E651">:</font><font color="#ff0000"> alloy</font>
+<font color="#ff0000">LAST DEPLOYED</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#F3E651">...</font>
+<font color="#ff0000">NAMESPACE</font><font color="#F3E651">:</font><font color="#ff0000"> monitoring</font>
+<font color="#ff0000">STATUS</font><font color="#F3E651">:</font><font color="#ff0000"> deployed</font>
</pre>
<br />
<span>Loki runs in single-binary mode with a single replica (<span class='inlinecode'>loki-0</span>), which is appropriate for a home lab cluster. This means there&#39;s only one Loki pod running at any time. If the node hosting Loki fails, Kubernetes will automatically reschedule the pod to another worker node—but there will be a brief downtime (typically under a minute) while this happens. For my home lab use case, this is perfectly acceptable.</span><br />
@@ -3832,44 +3872,44 @@ STATUS: deployed
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>discovery.kubernetes <font color="#808080">"pods"</font> {
- role = <font color="#808080">"pod"</font>
-}
+<pre><font color="#ff0000">discovery</font><font color="#F3E651">.</font><font color="#ff0000">kubernetes </font><font color="#bb00ff">"pods"</font><font color="#ff0000"> {</font>
+<font color="#ff0000"> role </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#bb00ff">"pod"</font>
+<font color="#ff0000">}</font>
-discovery.relabel <font color="#808080">"pods"</font> {
- targets = discovery.kubernetes.pods.targets
+<font color="#ff0000">discovery</font><font color="#F3E651">.</font><font color="#ff0000">relabel </font><font color="#bb00ff">"pods"</font><font color="#ff0000"> {</font>
+<font color="#ff0000"> targets </font><font color="#F3E651">=</font><font color="#ff0000"> discovery</font><font color="#F3E651">.</font><font color="#ff0000">kubernetes</font><font color="#F3E651">.</font><font color="#ff0000">pods</font><font color="#F3E651">.</font><font color="#ff0000">targets</font>
- rule {
- source_labels = [<font color="#808080">"__meta_kubernetes_namespace"</font>]
- target_label = <font color="#808080">"namespace"</font>
- }
+<font color="#ff0000"> rule {</font>
+<font color="#ff0000"> source_labels </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#F3E651">[</font><font color="#bb00ff">"__meta_kubernetes_namespace"</font><font color="#F3E651">]</font>
+<font color="#ff0000"> target_label </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#bb00ff">"namespace"</font>
+<font color="#ff0000"> }</font>
- rule {
- source_labels = [<font color="#808080">"__meta_kubernetes_pod_name"</font>]
- target_label = <font color="#808080">"pod"</font>
- }
+<font color="#ff0000"> rule {</font>
+<font color="#ff0000"> source_labels </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#F3E651">[</font><font color="#bb00ff">"__meta_kubernetes_pod_name"</font><font color="#F3E651">]</font>
+<font color="#ff0000"> target_label </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#bb00ff">"pod"</font>
+<font color="#ff0000"> }</font>
- rule {
- source_labels = [<font color="#808080">"__meta_kubernetes_pod_container_name"</font>]
- target_label = <font color="#808080">"container"</font>
- }
+<font color="#ff0000"> rule {</font>
+<font color="#ff0000"> source_labels </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#F3E651">[</font><font color="#bb00ff">"__meta_kubernetes_pod_container_name"</font><font color="#F3E651">]</font>
+<font color="#ff0000"> target_label </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#bb00ff">"container"</font>
+<font color="#ff0000"> }</font>
- rule {
- source_labels = [<font color="#808080">"__meta_kubernetes_pod_label_app"</font>]
- target_label = <font color="#808080">"app"</font>
- }
-}
+<font color="#ff0000"> rule {</font>
+<font color="#ff0000"> source_labels </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#F3E651">[</font><font color="#bb00ff">"__meta_kubernetes_pod_label_app"</font><font color="#F3E651">]</font>
+<font color="#ff0000"> target_label </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#bb00ff">"app"</font>
+<font color="#ff0000"> }</font>
+<font color="#ff0000">}</font>
-loki.<b><u><font color="#000000">source</font></u></b>.kubernetes <font color="#808080">"pods"</font> {
- targets = discovery.relabel.pods.output
- forward_to = [loki.write.default.receiver]
-}
+<font color="#ff0000">loki</font><font color="#F3E651">.</font><b><font color="#ffffff">source</font></b><font color="#F3E651">.</font><font color="#ff0000">kubernetes </font><font color="#bb00ff">"pods"</font><font color="#ff0000"> {</font>
+<font color="#ff0000"> targets </font><font color="#F3E651">=</font><font color="#ff0000"> discovery</font><font color="#F3E651">.</font><font color="#ff0000">relabel</font><font color="#F3E651">.</font><font color="#ff0000">pods</font><font color="#F3E651">.</font><font color="#ff0000">output</font>
+<font color="#ff0000"> forward_to </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#F3E651">[</font><font color="#ff0000">loki</font><font color="#F3E651">.</font><font color="#ff0000">write</font><font color="#F3E651">.</font><font color="#ff0000">default</font><font color="#F3E651">.</font><font color="#ff0000">receiver</font><font color="#F3E651">]</font>
+<font color="#ff0000">}</font>
-loki.write <font color="#808080">"default"</font> {
- endpoint {
- url = <font color="#808080">"http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push"</font>
- }
-}
+<font color="#ff0000">loki</font><font color="#F3E651">.</font><font color="#ff0000">write </font><font color="#bb00ff">"default"</font><font color="#ff0000"> {</font>
+<font color="#ff0000"> endpoint {</font>
+<font color="#ff0000"> url </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#bb00ff">"http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push"</font>
+<font color="#ff0000"> }</font>
+<font color="#ff0000">}</font>
</pre>
<br />
<span>This configuration automatically labels each log line with the namespace, pod name, container name, and app label, making it easy to filter logs in Grafana.</span><br />
@@ -3882,9 +3922,9 @@ loki.write <font color="#808080">"default"</font> {
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ kubectl get svc -n monitoring loki
-NAME TYPE CLUSTER-IP PORT(S)
-loki ClusterIP <font color="#000000">10.43</font>.<font color="#000000">64.60</font> <font color="#000000">3100</font>/TCP,<font color="#000000">9095</font>/TCP
+<pre><font color="#ff0000">$ kubectl get svc -n monitoring loki</font>
+<font color="#ff0000">NAME TYPE CLUSTER-IP PORT</font><font color="#F3E651">(</font><font color="#ff0000">S</font><font color="#F3E651">)</font>
+<font color="#ff0000">loki ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">64.60</font><font color="#ff0000"> </font><font color="#bb00ff">3100</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">9095</font><font color="#ff0000">/TCP</font>
</pre>
<br />
<span>To add Loki as a data source in Grafana:</span><br />
@@ -3908,20 +3948,20 @@ loki ClusterIP <font color="#000000">10.43</font>.<font color="#000000">64.6
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ kubectl get pods -n monitoring
-NAME READY STATUS RESTARTS AGE
-alertmanager-prometheus-kube-prometheus-alertmanager-<font color="#000000">0</font> <font color="#000000">2</font>/<font color="#000000">2</font> Running <font color="#000000">0</font> 42d
-alloy-g5fgj <font color="#000000">2</font>/<font color="#000000">2</font> Running <font color="#000000">0</font> 29m
-alloy-nfw8w <font color="#000000">2</font>/<font color="#000000">2</font> Running <font color="#000000">0</font> 29m
-alloy-tg9vj <font color="#000000">2</font>/<font color="#000000">2</font> Running <font color="#000000">0</font> 29m
-loki-<font color="#000000">0</font> <font color="#000000">2</font>/<font color="#000000">2</font> Running <font color="#000000">0</font> 25m
-prometheus-grafana-868f9dc7cf-lg2vl <font color="#000000">3</font>/<font color="#000000">3</font> Running <font color="#000000">0</font> 42d
-prometheus-kube-prometheus-operator-8d7bbc48c-p4sf4 <font color="#000000">1</font>/<font color="#000000">1</font> Running <font color="#000000">0</font> 42d
-prometheus-kube-state-metrics-7c5fb9d798-hh2fx <font color="#000000">1</font>/<font color="#000000">1</font> Running <font color="#000000">0</font> 42d
-prometheus-prometheus-kube-prometheus-prometheus-<font color="#000000">0</font> <font color="#000000">2</font>/<font color="#000000">2</font> Running <font color="#000000">0</font> 42d
-prometheus-prometheus-node-exporter-2nsg9 <font color="#000000">1</font>/<font color="#000000">1</font> Running <font color="#000000">0</font> 42d
-prometheus-prometheus-node-exporter-mqr<font color="#000000">25</font> <font color="#000000">1</font>/<font color="#000000">1</font> Running <font color="#000000">0</font> 42d
-prometheus-prometheus-node-exporter-wp4ds <font color="#000000">1</font>/<font color="#000000">1</font> Running <font color="#000000">0</font> 42d
+<pre><font color="#ff0000">$ kubectl get pods -n monitoring</font>
+<font color="#ff0000">NAME READY STATUS RESTARTS AGE</font>
+<font color="#ff0000">alertmanager-prometheus-kube-prometheus-alertmanager-</font><font color="#bb00ff">0</font><font color="#ff0000"> </font><font color="#bb00ff">2</font><font color="#F3E651">/</font><font color="#bb00ff">2</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">alloy-g5fgj </font><font color="#bb00ff">2</font><font color="#F3E651">/</font><font color="#bb00ff">2</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 29m</font>
+<font color="#ff0000">alloy-nfw8w </font><font color="#bb00ff">2</font><font color="#F3E651">/</font><font color="#bb00ff">2</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 29m</font>
+<font color="#ff0000">alloy-tg9vj </font><font color="#bb00ff">2</font><font color="#F3E651">/</font><font color="#bb00ff">2</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 29m</font>
+<font color="#ff0000">loki-</font><font color="#bb00ff">0</font><font color="#ff0000"> </font><font color="#bb00ff">2</font><font color="#F3E651">/</font><font color="#bb00ff">2</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 25m</font>
+<font color="#ff0000">prometheus-grafana-868f9dc7cf-lg2vl </font><font color="#bb00ff">3</font><font color="#F3E651">/</font><font color="#bb00ff">3</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">prometheus-kube-prometheus-operator-8d7bbc48c-p4sf4 </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">prometheus-kube-state-metrics-7c5fb9d798-hh2fx </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">prometheus-prometheus-kube-prometheus-prometheus-</font><font color="#bb00ff">0</font><font color="#ff0000"> </font><font color="#bb00ff">2</font><font color="#F3E651">/</font><font color="#bb00ff">2</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">prometheus-prometheus-node-exporter-2nsg9 </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">prometheus-prometheus-node-exporter-mqr</font><font color="#bb00ff">25</font><font color="#ff0000"> </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">prometheus-prometheus-node-exporter-wp4ds </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
</pre>
<br />
<span>And the services:</span><br />
@@ -3930,18 +3970,18 @@ prometheus-prometheus-node-exporter-wp4ds <font color="#000000">1
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ kubectl get svc -n monitoring
-NAME TYPE CLUSTER-IP PORT(S)
-alertmanager-operated ClusterIP None <font color="#000000">9093</font>/TCP,<font color="#000000">9094</font>/TCP
-alloy ClusterIP <font color="#000000">10.43</font>.<font color="#000000">74.14</font> <font color="#000000">12345</font>/TCP
-loki ClusterIP <font color="#000000">10.43</font>.<font color="#000000">64.60</font> <font color="#000000">3100</font>/TCP,<font color="#000000">9095</font>/TCP
-loki-headless ClusterIP None <font color="#000000">3100</font>/TCP
-prometheus-grafana ClusterIP <font color="#000000">10.43</font>.<font color="#000000">46.82</font> <font color="#000000">80</font>/TCP
-prometheus-kube-prometheus-alertmanager ClusterIP <font color="#000000">10.43</font>.<font color="#000000">208.43</font> <font color="#000000">9093</font>/TCP,<font color="#000000">8080</font>/TCP
-prometheus-kube-prometheus-operator ClusterIP <font color="#000000">10.43</font>.<font color="#000000">246.121</font> <font color="#000000">443</font>/TCP
-prometheus-kube-prometheus-prometheus ClusterIP <font color="#000000">10.43</font>.<font color="#000000">152.163</font> <font color="#000000">9090</font>/TCP,<font color="#000000">8080</font>/TCP
-prometheus-kube-state-metrics ClusterIP <font color="#000000">10.43</font>.<font color="#000000">64.26</font> <font color="#000000">8080</font>/TCP
-prometheus-prometheus-node-exporter ClusterIP <font color="#000000">10.43</font>.<font color="#000000">127.242</font> <font color="#000000">9100</font>/TCP
+<pre><font color="#ff0000">$ kubectl get svc -n monitoring</font>
+<font color="#ff0000">NAME TYPE CLUSTER-IP PORT</font><font color="#F3E651">(</font><font color="#ff0000">S</font><font color="#F3E651">)</font>
+<font color="#ff0000">alertmanager-operated ClusterIP None </font><font color="#bb00ff">9093</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">9094</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">alloy ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">74.14</font><font color="#ff0000"> </font><font color="#bb00ff">12345</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">loki ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">64.60</font><font color="#ff0000"> </font><font color="#bb00ff">3100</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">9095</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">loki-headless ClusterIP None </font><font color="#bb00ff">3100</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">prometheus-grafana ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">46.82</font><font color="#ff0000"> </font><font color="#bb00ff">80</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">prometheus-kube-prometheus-alertmanager ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">208.43</font><font color="#ff0000"> </font><font color="#bb00ff">9093</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">8080</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">prometheus-kube-prometheus-operator ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">246.121</font><font color="#ff0000"> </font><font color="#bb00ff">443</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">prometheus-kube-prometheus-prometheus ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">152.163</font><font color="#ff0000"> </font><font color="#bb00ff">9090</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">8080</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">prometheus-kube-state-metrics ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">64.26</font><font color="#ff0000"> </font><font color="#bb00ff">8080</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">prometheus-prometheus-node-exporter ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">127.242</font><font color="#ff0000"> </font><font color="#bb00ff">9100</font><font color="#ff0000">/TCP</font>
</pre>
<br />
<span>Let me break down what each pod does:</span><br />
@@ -4015,7 +4055,7 @@ prometheus-prometheus-node-exporter ClusterIP <font color="#000000">10.4
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>paul@f0:~ % doas pkg install -y node_exporter
+<pre><font color="#ff0000">paul@f0</font><font color="#F3E651">:~</font><font color="#ff0000"> </font><font color="#F3E651">%</font><font color="#ff0000"> doas pkg install -y node_exporter</font>
</pre>
<br />
<span>Enable the service to start at boot:</span><br />
@@ -4024,8 +4064,8 @@ http://www.gnu.org/software/src-highlite -->
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>paul@f0:~ % doas sysrc node_exporter_enable=YES
-node_exporter_enable: -&gt; YES
+<pre><font color="#ff0000">paul@f0</font><font color="#F3E651">:~</font><font color="#ff0000"> </font><font color="#F3E651">%</font><font color="#ff0000"> doas sysrc </font><font color="#ff0000">node_exporter_enable</font><font color="#F3E651">=</font><font color="#ff0000">YES</font>
+<font color="#ff0000">node_exporter_enable</font><font color="#F3E651">:</font><font color="#ff0000"> -</font><font color="#F3E651">&gt;</font><font color="#ff0000"> YES</font>
</pre>
<br />
<span>Configure node_exporter to listen on the WireGuard interface. This ensures metrics are only accessible through the secure tunnel, not the public network. Replace the IP with the host&#39;s WireGuard address:</span><br />
@@ -4034,8 +4074,8 @@ node_exporter_enable: -&gt; YES
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>paul@f0:~ % doas sysrc node_exporter_args=<font color="#808080">'--web.listen-address=192.168.2.130:9100'</font>
-node_exporter_args: -&gt; --web.listen-address=<font color="#000000">192.168</font>.<font color="#000000">2.130</font>:<font color="#000000">9100</font>
+<pre><font color="#ff0000">paul@f0</font><font color="#F3E651">:~</font><font color="#ff0000"> </font><font color="#F3E651">%</font><font color="#ff0000"> doas sysrc </font><font color="#ff0000">node_exporter_args</font><font color="#F3E651">=</font><font color="#bb00ff">'--web.listen-address=192.168.2.130:9100'</font>
+<font color="#ff0000">node_exporter_args</font><font color="#F3E651">:</font><font color="#ff0000"> -</font><font color="#F3E651">&gt;</font><font color="#ff0000"> --web</font><font color="#F3E651">.</font><font color="#ff0000">listen-address</font><font color="#F3E651">=</font><font color="#bb00ff">192.168</font><font color="#F3E651">.</font><font color="#bb00ff">2.130</font><font color="#F3E651">:</font><font color="#bb00ff">9100</font>
</pre>
<br />
<span>Start the service:</span><br />
@@ -4044,8 +4084,8 @@ node_exporter_args: -&gt; --web.listen-address=<font color="#000000">192.168</f
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>paul@f0:~ % doas service node_exporter start
-Starting node_exporter.
+<pre><font color="#ff0000">paul@f0</font><font color="#F3E651">:~</font><font color="#ff0000"> </font><font color="#F3E651">%</font><font color="#ff0000"> doas service node_exporter start</font>
+<font color="#ff0000">Starting node_exporter</font><font color="#F3E651">.</font>
</pre>
<br />
<span>Verify it&#39;s running:</span><br />
@@ -4054,10 +4094,10 @@ Starting node_exporter.
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>paul@f0:~ % curl -s http://<font color="#000000">192.168</font>.<font color="#000000">2.130</font>:<font color="#000000">9100</font>/metrics | head -<font color="#000000">3</font>
-<i><font color="silver"># HELP go_gc_duration_seconds A summary of the wall-time pause...</font></i>
-<i><font color="silver"># TYPE go_gc_duration_seconds summary</font></i>
-go_gc_duration_seconds{quantile=<font color="#808080">"0"</font>} <font color="#000000">0</font>
+<pre><font color="#ff0000">paul@f0</font><font color="#F3E651">:~</font><font color="#ff0000"> </font><font color="#F3E651">%</font><font color="#ff0000"> curl -s http</font><font color="#F3E651">://</font><font color="#bb00ff">192.168</font><font color="#F3E651">.</font><font color="#bb00ff">2.130</font><font color="#F3E651">:</font><font color="#bb00ff">9100</font><font color="#ff0000">/metrics </font><font color="#F3E651">|</font><font color="#ff0000"> head -</font><font color="#bb00ff">3</font>
+<i><font color="#ababab"># HELP go_gc_duration_seconds A summary of the wall-time pause...</font></i>
+<i><font color="#ababab"># TYPE go_gc_duration_seconds summary</font></i>
+<font color="#ff0000">go_gc_duration_seconds{</font><font color="#ff0000">quantile</font><font color="#F3E651">=</font><font color="#bb00ff">"0"</font><font color="#ff0000">} </font><font color="#bb00ff">0</font>
</pre>
<br />
<span>Repeat for the other FreeBSD hosts (<span class='inlinecode'>f1</span>, <span class='inlinecode'>f2</span>) with their respective WireGuard IPs.</span><br />
@@ -4085,9 +4125,9 @@ go_gc_duration_seconds{quantile=<font color="#808080">"0"</font>} <font color="#
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ kubectl create secret generic additional-scrape-configs \
- --from-file=additional-scrape-configs.yaml \
- -n monitoring
+<pre><font color="#ff0000">$ kubectl create secret generic additional-scrape-configs </font><font color="#F3E651">\</font>
+<font color="#ff0000"> --from-file</font><font color="#F3E651">=</font><font color="#ff0000">additional-scrape-configs</font><font color="#F3E651">.</font><font color="#ff0000">yaml </font><font color="#F3E651">\</font>
+<font color="#ff0000"> -n monitoring</font>
</pre>
<br />
<span>Update <span class='inlinecode'>persistence-values.yaml</span> to reference the secret:</span><br />
@@ -4107,7 +4147,7 @@ prometheus:
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>$ just upgrade
+<pre><font color="#ff0000">$ just upgrade</font>
</pre>
<br />
<span>After a minute or so, the FreeBSD hosts appear in the Prometheus targets and in the Node Exporter dashboards in Grafana.</span><br />
@@ -4155,7 +4195,7 @@ spec:
<br />
<span>Unlike memory metrics, disk I/O metrics (<span class='inlinecode'>node_disk_read_bytes_total</span>, <span class='inlinecode'>node_disk_written_bytes_total</span>, etc.) are not available on FreeBSD. The Linux diskstats collector that provides these metrics doesn&#39;t have a FreeBSD equivalent in the node_exporter.</span><br />
<br />
-<span>The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (<span class='inlinecode'>node_zfs_arcstats_*</span>) for ARC cache performance, and per-dataset I/O stats are available via <span class='inlinecode'>sysctl kstat.zfs</span>, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Creating custom ZFS-specific dashboards is left as an exercise for another day.</span><br />
+<span>The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (<span class='inlinecode'>node_zfs_arcstats_*</span>) for ARC cache performance, and per-dataset I/O stats are available via <span class='inlinecode'>sysctl kstat.zfs</span>, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Custom ZFS-specific dashboards are covered later in this post.</span><br />
<br />
<h2 style='display: inline' id='monitoring-external-openbsd-hosts'>Monitoring external OpenBSD hosts</h2><br />
<br />
@@ -4169,10 +4209,10 @@ spec:
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>blowfish:~ $ doas pkg_add node_exporter
-quirks-<font color="#000000">7.103</font> signed on <font color="#000000">2025</font>-<font color="#000000">10</font>-13T22:<font color="#000000">55</font>:16Z
-The following new rcscripts were installed: /etc/rc.d/node_exporter
-See rcctl(<font color="#000000">8</font>) <b><u><font color="#000000">for</font></u></b> details.
+<pre><font color="#ff0000">blowfish</font><font color="#F3E651">:~</font><font color="#ff0000"> $ doas pkg_add node_exporter</font>
+<font color="#ff0000">quirks-</font><font color="#bb00ff">7.103</font><font color="#ff0000"> signed on </font><font color="#bb00ff">2025</font><font color="#ff0000">-</font><font color="#bb00ff">10</font><font color="#ff0000">-13T22</font><font color="#F3E651">:</font><font color="#bb00ff">55</font><font color="#F3E651">:</font><font color="#ff0000">16Z</font>
+<font color="#ff0000">The following new rcscripts were installed</font><font color="#F3E651">:</font><font color="#ff0000"> /etc/rc</font><font color="#F3E651">.</font><font color="#ff0000">d/node_exporter</font>
+<font color="#ff0000">See rcctl</font><font color="#F3E651">(</font><font color="#bb00ff">8</font><font color="#F3E651">)</font><font color="#ff0000"> </font><b><font color="#ffffff">for</font></b><font color="#ff0000"> details</font><font color="#F3E651">.</font>
</pre>
<br />
<span>Enable the service to start at boot:</span><br />
@@ -4181,7 +4221,7 @@ See rcctl(<font color="#000000">8</font>) <b><u><font color="#000000">for</font>
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>blowfish:~ $ doas rcctl <b><u><font color="#000000">enable</font></u></b> node_exporter
+<pre><font color="#ff0000">blowfish</font><font color="#F3E651">:~</font><font color="#ff0000"> $ doas rcctl </font><b><font color="#ffffff">enable</font></b><font color="#ff0000"> node_exporter</font>
</pre>
<br />
<span>Configure node_exporter to listen on the WireGuard interface. This ensures metrics are only accessible through the secure tunnel, not the public network. Replace the IP with the host&#39;s WireGuard address:</span><br />
@@ -4190,7 +4230,7 @@ http://www.gnu.org/software/src-highlite -->
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>blowfish:~ $ doas rcctl <b><u><font color="#000000">set</font></u></b> node_exporter flags <font color="#808080">'--web.listen-address=192.168.2.110:9100'</font>
+<pre><font color="#ff0000">blowfish</font><font color="#F3E651">:~</font><font color="#ff0000"> $ doas rcctl </font><b><font color="#ffffff">set</font></b><font color="#ff0000"> node_exporter flags </font><font color="#bb00ff">'--web.listen-address=192.168.2.110:9100'</font>
</pre>
<br />
<span>Start the service:</span><br />
@@ -4199,8 +4239,8 @@ http://www.gnu.org/software/src-highlite -->
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>blowfish:~ $ doas rcctl start node_exporter
-node_exporter(ok)
+<pre><font color="#ff0000">blowfish</font><font color="#F3E651">:~</font><font color="#ff0000"> $ doas rcctl start node_exporter</font>
+<font color="#ff0000">node_exporter</font><font color="#F3E651">(</font><font color="#ff0000">ok</font><font color="#F3E651">)</font>
</pre>
<br />
<span>Verify it&#39;s running:</span><br />
@@ -4209,10 +4249,10 @@ node_exporter(ok)
by Lorenzo Bettini
http://www.lorenzobettini.it
http://www.gnu.org/software/src-highlite -->
-<pre>blowfish:~ $ curl -s http://<font color="#000000">192.168</font>.<font color="#000000">2.110</font>:<font color="#000000">9100</font>/metrics | head -<font color="#000000">3</font>
-<i><font color="silver"># HELP go_gc_duration_seconds A summary of the wall-time pause...</font></i>
-<i><font color="silver"># TYPE go_gc_duration_seconds summary</font></i>
-go_gc_duration_seconds{quantile=<font color="#808080">"0"</font>} <font color="#000000">0</font>
+<pre><font color="#ff0000">blowfish</font><font color="#F3E651">:~</font><font color="#ff0000"> $ curl -s http</font><font color="#F3E651">://</font><font color="#bb00ff">192.168</font><font color="#F3E651">.</font><font color="#bb00ff">2.110</font><font color="#F3E651">:</font><font color="#bb00ff">9100</font><font color="#ff0000">/metrics </font><font color="#F3E651">|</font><font color="#ff0000"> head -</font><font color="#bb00ff">3</font>
+<i><font color="#ababab"># HELP go_gc_duration_seconds A summary of the wall-time pause...</font></i>
+<i><font color="#ababab"># TYPE go_gc_duration_seconds summary</font></i>
+<font color="#ff0000">go_gc_duration_seconds{</font><font color="#ff0000">quantile</font><font color="#F3E651">=</font><font color="#bb00ff">"0"</font><font color="#ff0000">} </font><font color="#bb00ff">0</font>
</pre>
<br />
<span>Repeat for the other OpenBSD host (<span class='inlinecode'>fishfinger</span>) with its respective WireGuard IP (<span class='inlinecode'>192.168.2.111</span>).</span><br />
@@ -4282,18 +4322,1127 @@ spec:
<br />
<span>After running <span class='inlinecode'>just upgrade</span>, the OpenBSD hosts appear in Prometheus targets and the Node Exporter dashboards.</span><br />
<br />
+<span class='quote'>Updated Mon 09 Mar: Added section about enabling etcd metrics</span><br />
+<br />
+<h2 style='display: inline' id='enabling-etcd-metrics-in-k3s'>Enabling etcd metrics in k3s</h2><br />
+<br />
+<span>The etcd dashboard in Grafana initially showed no data because k3s uses an embedded etcd that doesn&#39;t expose metrics by default.</span><br />
+<br />
+<span>On each control-plane node (r0, r1, r2), create /etc/rancher/k3s/config.yaml:</span><br />
+<br />
+<pre>
+etcd-expose-metrics: true
+</pre>
+<br />
+<span>Then restart k3s on each node:</span><br />
+<br />
+<pre>
+systemctl restart k3s
+</pre>
+<br />
+<span>After restarting, etcd metrics are available on port 2381:</span><br />
+<br />
+<pre>
+curl http://127.0.0.1:2381/metrics | grep etcd
+</pre>
+<br />
+<h3 style='display: inline' id='configuring-prometheus-to-scrape-etcd'>Configuring Prometheus to scrape etcd</h3><br />
+<br />
+<span>In persistence-values.yaml, enable kubeEtcd with the node IP addresses:</span><br />
+<br />
+<pre>
+kubeEtcd:
+ enabled: true
+ endpoints:
+ - 192.168.1.120
+ - 192.168.1.121
+ - 192.168.1.122
+ service:
+ enabled: true
+ port: 2381
+ targetPort: 2381
+</pre>
+<br />
+<span>Apply the changes:</span><br />
+<br />
+<pre>
+just upgrade
+</pre>
+<br />
+<h3 style='display: inline' id='verifying-etcd-metrics'>Verifying etcd metrics</h3><br />
+<br />
+<span>After the changes, all etcd targets are being scraped:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 \
+ -c prometheus -- wget -qO- &#39;http://localhost:9090/api/v1/query?query=etcd_server_has_leader&#39; | \
+ jq -r &#39;.data.result[] | "\(.metric.instance): \(.value[1])"&#39;
+</pre>
+<br />
+<span>Output:</span><br />
+<br />
+<pre>
+192.168.1.120:2381: 1
+192.168.1.121:2381: 1
+192.168.1.122:2381: 1
+</pre>
+<br />
+<span>The etcd dashboard in Grafana now displays metrics including Raft proposals, leader elections, and peer round trip times.</span><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-etcd-dashboard.png'><img alt='Grafana etcd dashboard showing cluster health, RPC rate, disk sync duration, and peer round trip times' title='Grafana etcd dashboard showing cluster health, RPC rate, disk sync duration, and peer round trip times' src='./f3s-kubernetes-with-freebsd-part-8/grafana-etcd-dashboard.png' /></a><br />
+<br />
+<h3 style='display: inline' id='complete-persistence-valuesyaml'>Complete persistence-values.yaml</h3><br />
+<br />
+<span>The complete updated persistence-values.yaml:</span><br />
+<br />
+<pre>
+kubeEtcd:
+ enabled: true
+ endpoints:
+ - 192.168.1.120
+ - 192.168.1.121
+ - 192.168.1.122
+ service:
+ enabled: true
+ port: 2381
+ targetPort: 2381
+
+prometheus:
+ prometheusSpec:
+ additionalScrapeConfigsSecret:
+ enabled: true
+ name: additional-scrape-configs
+ key: additional-scrape-configs.yaml
+ storageSpec:
+ volumeClaimTemplate:
+ spec:
+ storageClassName: ""
+ accessModes: ["ReadWriteOnce"]
+ resources:
+ requests:
+ storage: 10Gi
+ selector:
+ matchLabels:
+ type: local
+ app: prometheus
+
+grafana:
+ persistence:
+ enabled: true
+ type: pvc
+ existingClaim: "grafana-data-pvc"
+
+ initChownData:
+ enabled: false
+
+ podSecurityContext:
+ fsGroup: 911
+ runAsUser: 911
+ runAsGroup: 911
+</pre>
+<br />
+<span class='quote'>Updated Mon 09 Mar: Added section about ZFS monitoring for FreeBSD servers</span><br />
+<br />
+<h2 style='display: inline' id='zfs-monitoring-for-freebsd-servers'>ZFS Monitoring for FreeBSD Servers</h2><br />
+<br />
+<span>The FreeBSD servers (f0, f1, f2) that provide NFS storage to the k3s cluster have ZFS filesystems. Monitoring ZFS performance is crucial for understanding storage performance and cache efficiency.</span><br />
+<br />
+<h3 style='display: inline' id='node-exporter-zfs-collector'>Node Exporter ZFS Collector</h3><br />
+<br />
+<span>The node_exporter running on each FreeBSD server (v1.9.1) includes a built-in ZFS collector that exposes metrics via sysctls. The ZFS collector is enabled by default and provides:</span><br />
+<br />
+<ul>
+<li>ARC (Adaptive Replacement Cache) statistics</li>
+<li>Cache hit/miss rates</li>
+<li>Memory usage and allocation</li>
+<li>MRU/MFU cache breakdown</li>
+<li>Data vs metadata distribution</li>
+</ul><br />
+<h3 style='display: inline' id='verifying-zfs-metrics'>Verifying ZFS Metrics</h3><br />
+<br />
+<span>On any FreeBSD server, check that ZFS metrics are being exposed:</span><br />
+<br />
+<pre>
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep node_zfs_arcstats | wc -l
+ 69
+</pre>
+<br />
+<span>The metrics are automatically scraped by Prometheus through the existing static configuration in additional-scrape-configs.yaml which targets all FreeBSD servers on port 9100 with the os: freebsd label.</span><br />
+<br />
+<h3 style='display: inline' id='zfs-recording-rules'>ZFS Recording Rules</h3><br />
+<br />
+<span>Created recording rules for easier dashboard consumption in zfs-recording-rules.yaml:</span><br />
+<br />
+<pre>
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: freebsd-zfs-rules
+ namespace: monitoring
+ labels:
+ release: prometheus
+spec:
+ groups:
+ - name: freebsd-zfs-arc
+ interval: 30s
+ rules:
+ - record: node_zfs_arc_hit_rate_percent
+ expr: |
+ 100 * (
+ rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) /
+ (rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) +
+ rate(node_zfs_arcstats_misses_total{os="freebsd"}[5m]))
+ )
+ labels:
+ os: freebsd
+ - record: node_zfs_arc_memory_usage_percent
+ expr: |
+ 100 * (
+ node_zfs_arcstats_size_bytes{os="freebsd"} /
+ node_zfs_arcstats_c_max_bytes{os="freebsd"}
+ )
+ labels:
+ os: freebsd
+ # Additional rules for metadata %, target %, MRU/MFU %, etc.
+</pre>
+<br />
+<span>These recording rules calculate:</span><br />
+<br />
+<ul>
+<li>ARC hit rate percentage</li>
+<li>ARC memory usage percentage (current vs maximum)</li>
+<li>ARC target percentage (target vs maximum)</li>
+<li>Metadata vs data percentages</li>
+<li>MRU vs MFU cache percentages</li>
+<li>Demand data and metadata hit rates</li>
+</ul><br />
+<h3 style='display: inline' id='grafana-dashboards'>Grafana Dashboards</h3><br />
+<br />
+<span>Created two comprehensive ZFS monitoring dashboards (zfs-dashboards.yaml):</span><br />
+<br />
+<span>**Dashboard 1: FreeBSD ZFS (per-host detailed view)**</span><br />
+<br />
+<span>Includes variables to select:</span><br />
+<ul>
+<li>FreeBSD server (f0, f1, or f2)</li>
+<li>ZFS pool (zdata, zroot, or all)</li>
+</ul><br />
+<span>**Pool Overview Row:**</span><br />
+<ul>
+<li>Pool Capacity gauge (with thresholds: green &lt;70%, yellow &lt;85%, red &gt;85%)</li>
+<li>Pool Health status (ONLINE/DEGRADED/FAULTED with color coding)</li>
+<li>Total Pool Size stat</li>
+<li>Free Space stat</li>
+<li>Pool Space Usage Over Time (stacked: used + free)</li>
+<li>Pool Capacity Trend time series</li>
+</ul><br />
+<span>**Dataset Statistics Row:**</span><br />
+<ul>
+<li>Table showing all datasets with columns: Pool, Dataset, Used, Available, Referenced</li>
+<li>Automatically filters by selected pool</li>
+</ul><br />
+<span>**ARC Cache Statistics Row:**</span><br />
+<ul>
+<li>ARC Hit Rate gauge (red &lt;70%, yellow &lt;90%, green &gt;=90%)</li>
+<li>ARC Size time series (current, target, max)</li>
+<li>ARC Memory Usage percentage gauge</li>
+<li>ARC Hits vs Misses rate</li>
+<li>ARC Data vs Metadata stacked time series</li>
+</ul><br />
+<span>**Dashboard 2: FreeBSD ZFS Summary (cluster-wide overview)**</span><br />
+<br />
+<span>**Cluster-Wide Pool Statistics Row:**</span><br />
+<ul>
+<li>Total Storage Capacity across all servers</li>
+<li>Total Used space</li>
+<li>Total Free space</li>
+<li>Average Pool Capacity gauge</li>
+<li>Pool Health Status (worst case across cluster)</li>
+<li>Total Pool Space Usage Over Time</li>
+<li>Per-Pool Capacity time series (all pools on all hosts)</li>
+</ul><br />
+<span>**Per-Host Pool Breakdown Row:**</span><br />
+<ul>
+<li>Bar gauge showing capacity by host and pool</li>
+<li>Table with all pools: Host, Pool, Size, Used, Free, Capacity %, Health</li>
+</ul><br />
+<span>**Cluster-Wide ARC Statistics Row:**</span><br />
+<ul>
+<li>Average ARC Hit Rate gauge across all hosts</li>
+<li>ARC Hit Rate by Host time series</li>
+<li>Total ARC Size Across Cluster</li>
+<li>Total ARC Hits vs Misses (cluster-wide sum)</li>
+<li>ARC Size by Host</li>
+</ul><br />
+<span>**Dashboard Visualization:**</span><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-dashboard.png'><img alt='ZFS monitoring dashboard in Grafana showing pool capacity, health, and I/O throughput' title='ZFS monitoring dashboard in Grafana showing pool capacity, health, and I/O throughput' src='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-dashboard.png' /></a><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-arc-stats.png'><img alt='ZFS ARC cache statistics showing hit rate, memory usage, and size trends' title='ZFS ARC cache statistics showing hit rate, memory usage, and size trends' src='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-arc-stats.png' /></a><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-datasets.png'><img alt='ZFS datasets table and ARC data vs metadata breakdown' title='ZFS datasets table and ARC data vs metadata breakdown' src='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-datasets.png' /></a><br />
+<br />
+<h3 style='display: inline' id='deployment'>Deployment</h3><br />
+<br />
+<span>Applied the resources to the cluster:</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/prometheus
+kubectl apply -f zfs-recording-rules.yaml
+kubectl apply -f zfs-dashboards.yaml
+</pre>
+<br />
+<span>Updated Justfile to include ZFS recording rules in install and upgrade targets:</span><br />
+<br />
+<pre>
+install:
+ kubectl apply -f persistent-volumes.yaml
+ kubectl create secret generic additional-scrape-configs --from-file=additional-scrape-configs.yaml -n monitoring --dry-run=client -o yaml | kubectl apply -f -
+ helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring -f persistence-values.yaml
+ kubectl apply -f freebsd-recording-rules.yaml
+ kubectl apply -f openbsd-recording-rules.yaml
+ kubectl apply -f zfs-recording-rules.yaml
+ just -f grafana-ingress/Justfile install
+</pre>
+<br />
+<h3 style='display: inline' id='verifying-zfs-metrics-in-prometheus'>Verifying ZFS Metrics in Prometheus</h3><br />
+<br />
+<span>Check that ZFS metrics are being collected:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- &#39;http://localhost:9090/api/v1/query?query=node_zfs_arcstats_size_bytes&#39;
+</pre>
+<br />
+<span>Check recording rules are calculating correctly:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- &#39;http://localhost:9090/api/v1/query?query=node_zfs_arc_memory_usage_percent&#39;
+</pre>
+<br />
+<span>Example output shows memory usage percentage for each FreeBSD server:</span><br />
+<br />
+<pre>
+"result":[
+ {"metric":{"instance":"192.168.2.130:9100","os":"freebsd"},"value":[...,"37.58"]},
+ {"metric":{"instance":"192.168.2.131:9100","os":"freebsd"},"value":[...,"12.85"]},
+ {"metric":{"instance":"192.168.2.132:9100","os":"freebsd"},"value":[...,"13.44"]}
+]
+</pre>
+<br />
+<h3 style='display: inline' id='accessing-the-dashboards'>Accessing the Dashboards</h3><br />
+<br />
+<span>The dashboards are automatically imported by the Grafana sidecar and accessible at:</span><br />
+<br />
+<a class='textlink' href='https://grafana.f3s.buetow.org'>https://grafana.f3s.buetow.org</a><br />
+<br />
+<span>Navigate to Dashboards and search for:</span><br />
+<ul>
+<li>"FreeBSD ZFS" - detailed per-host view with pool and dataset breakdowns</li>
+<li>"FreeBSD ZFS Summary" - cluster-wide overview of all ZFS storage</li>
+</ul><br />
+<h3 style='display: inline' id='key-metrics-to-monitor'>Key Metrics to Monitor</h3><br />
+<br />
+<span>**ARC Hit Rate:** Should typically be above 90% for optimal performance. Lower hit rates indicate the ARC cache is too small or workload has poor locality.</span><br />
+<br />
+<span>**ARC Memory Usage:** Shows how much of the maximum ARC size is being used. If consistently at or near maximum, the ARC is effectively utilizing available memory.</span><br />
+<br />
+<span>**Data vs Metadata:** Typically data should dominate, but workloads with many small files will show higher metadata percentages.</span><br />
+<br />
+<span>**MRU vs MFU:** Most Recently Used vs Most Frequently Used cache. The ratio depends on workload characteristics.</span><br />
+<br />
+<span>**Pool Capacity:** Monitor pool usage to ensure adequate free space. ZFS performance degrades when pools exceed 80% capacity.</span><br />
+<br />
+<span>**Pool Health:** Should always show ONLINE (green). DEGRADED (yellow) indicates a disk issue requiring attention. FAULTED (red) requires immediate action.</span><br />
+<br />
+<span>**Dataset Usage:** Track which datasets are consuming the most space to identify growth trends and plan capacity.</span><br />
+<br />
+<h3 style='display: inline' id='zfs-pool-and-dataset-metrics-via-textfile-collector'>ZFS Pool and Dataset Metrics via Textfile Collector</h3><br />
+<br />
+<span>To complement the ARC statistics from node_exporter&#39;s built-in ZFS collector, I added pool capacity and dataset metrics using the textfile collector feature.</span><br />
+<br />
+<span>Created a script at /usr/local/bin/zfs_pool_metrics.sh on each FreeBSD server:</span><br />
+<br />
+<pre>
+#!/bin/sh
+# ZFS Pool and Dataset Metrics Collector for Prometheus
+
+OUTPUT_FILE="/var/tmp/node_exporter/zfs_pools.prom.$$"
+FINAL_FILE="/var/tmp/node_exporter/zfs_pools.prom"
+
+mkdir -p /var/tmp/node_exporter
+
+{
+ # Pool metrics
+ echo "# HELP zfs_pool_size_bytes Total size of ZFS pool"
+ echo "# TYPE zfs_pool_size_bytes gauge"
+ echo "# HELP zfs_pool_allocated_bytes Allocated space in ZFS pool"
+ echo "# TYPE zfs_pool_allocated_bytes gauge"
+ echo "# HELP zfs_pool_free_bytes Free space in ZFS pool"
+ echo "# TYPE zfs_pool_free_bytes gauge"
+ echo "# HELP zfs_pool_capacity_percent Capacity percentage"
+ echo "# TYPE zfs_pool_capacity_percent gauge"
+ echo "# HELP zfs_pool_health Pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)"
+ echo "# TYPE zfs_pool_health gauge"
+
+ zpool list -Hp -o name,size,allocated,free,capacity,health | \
+ while IFS=$&#39;\t&#39; read name size alloc free cap health; do
+ case "$health" in
+ ONLINE) health_val=0 ;;
+ DEGRADED) health_val=1 ;;
+ FAULTED) health_val=2 ;;
+ *) health_val=6 ;;
+ esac
+ cap_num=$(echo "$cap" | sed &#39;s/%//&#39;)
+
+ echo "zfs_pool_size_bytes{pool=\"$name\"} $size"
+ echo "zfs_pool_allocated_bytes{pool=\"$name\"} $alloc"
+ echo "zfs_pool_free_bytes{pool=\"$name\"} $free"
+ echo "zfs_pool_capacity_percent{pool=\"$name\"} $cap_num"
+ echo "zfs_pool_health{pool=\"$name\"} $health_val"
+ done
+
+ # Dataset metrics
+ echo "# HELP zfs_dataset_used_bytes Used space in dataset"
+ echo "# TYPE zfs_dataset_used_bytes gauge"
+ echo "# HELP zfs_dataset_available_bytes Available space"
+ echo "# TYPE zfs_dataset_available_bytes gauge"
+ echo "# HELP zfs_dataset_referenced_bytes Referenced space"
+ echo "# TYPE zfs_dataset_referenced_bytes gauge"
+
+ zfs list -Hp -t filesystem -o name,used,available,referenced | \
+ while IFS=$&#39;\t&#39; read name used avail ref; do
+ pool=$(echo "$name" | cut -d/ -f1)
+ echo "zfs_dataset_used_bytes{pool=\"$pool\",dataset=\"$name\"} $used"
+ echo "zfs_dataset_available_bytes{pool=\"$pool\",dataset=\"$name\"} $avail"
+ echo "zfs_dataset_referenced_bytes{pool=\"$pool\",dataset=\"$name\"} $ref"
+ done
+} &gt; "$OUTPUT_FILE"
+
+mv "$OUTPUT_FILE" "$FINAL_FILE"
+</pre>
+<br />
+<span>Deployed to all FreeBSD servers:</span><br />
+<br />
+<pre>
+for host in f0 f1 f2; do
+ scp /tmp/zfs_pool_metrics.sh paul@$host:/tmp/
+ ssh paul@$host &#39;doas mv /tmp/zfs_pool_metrics.sh /usr/local/bin/ &amp;&amp; \
+ doas chmod +x /usr/local/bin/zfs_pool_metrics.sh&#39;
+done
+</pre>
+<br />
+<span>Set up cron jobs to run every minute:</span><br />
+<br />
+<pre>
+for host in f0 f1 f2; do
+ ssh paul@$host &#39;echo "* * * * * /usr/local/bin/zfs_pool_metrics.sh &gt;/dev/null 2&gt;&amp;1" | \
+ doas crontab -&#39;
+done
+</pre>
+<br />
+<span>The textfile collector (already configured with --collector.textfile.directory=/var/tmp/node_exporter) automatically picks up the metrics.</span><br />
+<br />
+<span>Verify metrics are being exposed:</span><br />
+<br />
+<pre>
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep "^zfs_pool" | head -5
+zfs_pool_allocated_bytes{pool="zdata"} 6.47622733824e+11
+zfs_pool_allocated_bytes{pool="zroot"} 5.3338578944e+10
+zfs_pool_capacity_percent{pool="zdata"} 64
+zfs_pool_capacity_percent{pool="zroot"} 10
+zfs_pool_free_bytes{pool="zdata"} 3.48809678848e+11
+</pre>
+<br />
+<span class='quote'>Updated Mon 09 Mar: Added section about distributed tracing with Grafana Tempo</span><br />
+<br />
+<h2 style='display: inline' id='distributed-tracing-with-grafana-tempo'>Distributed Tracing with Grafana Tempo</h2><br />
+<br />
+<span>After implementing logs (Loki) and metrics (Prometheus), the final pillar of observability is distributed tracing. Grafana Tempo provides distributed tracing capabilities that help understand request flows across microservices.</span><br />
+<br />
+<span>How will this look tracing with Tempo like in Grafana? Have a look at the X-RAG blog post of mine:</span><br />
+<br />
+<a class='textlink' href='./2025-12-24-x-rag-observability-hackathon.html'>X-RAG Observability Hackathon</a><br />
+<br />
+<h3 style='display: inline' id='why-distributed-tracing'>Why Distributed Tracing?</h3><br />
+<br />
+<span>In a microservices architecture, a single user request may traverse multiple services. Distributed tracing:</span><br />
+<br />
+<ul>
+<li>Tracks requests across service boundaries</li>
+<li>Identifies performance bottlenecks</li>
+<li>Visualizes service dependencies</li>
+<li>Correlates with logs and metrics</li>
+<li>Helps debug complex distributed systems</li>
+</ul><br />
+<h3 style='display: inline' id='deploying-grafana-tempo'>Deploying Grafana Tempo</h3><br />
+<br />
+<span>Tempo is deployed in monolithic mode, following the same pattern as Loki&#39;s SingleBinary deployment.</span><br />
+<br />
+<span>#### Configuration Strategy</span><br />
+<br />
+<span>**Deployment Mode:** Monolithic (all components in one process)</span><br />
+<ul>
+<li>Simpler operation than microservices mode</li>
+<li>Suitable for the cluster scale</li>
+<li>Consistent with Loki deployment pattern</li>
+</ul><br />
+<span>**Storage:** Filesystem backend using hostPath</span><br />
+<ul>
+<li>10Gi storage at /data/nfs/k3svolumes/tempo/data</li>
+<li>7-day retention (168h)</li>
+<li>Local storage is the only option for monolithic mode</li>
+</ul><br />
+<span>**OTLP Receivers:** Standard OpenTelemetry Protocol ports</span><br />
+<ul>
+<li>gRPC: 4317</li>
+<li>HTTP: 4318</li>
+<li>Bind to 0.0.0.0 to avoid Tempo 2.7+ localhost-only binding issue</li>
+</ul><br />
+<span>#### Tempo Deployment Files</span><br />
+<br />
+<span>Created in /home/paul/git/conf/f3s/tempo/:</span><br />
+<br />
+<span>**values.yaml** - Helm chart configuration:</span><br />
+<br />
+<pre>
+tempo:
+ retention: 168h
+ storage:
+ trace:
+ backend: local
+ local:
+ path: /var/tempo/traces
+ wal:
+ path: /var/tempo/wal
+ receivers:
+ otlp:
+ protocols:
+ grpc:
+ endpoint: 0.0.0.0:4317
+ http:
+ endpoint: 0.0.0.0:4318
+
+persistence:
+ enabled: true
+ size: 10Gi
+ storageClassName: ""
+
+resources:
+ limits:
+ cpu: 1000m
+ memory: 2Gi
+ requests:
+ cpu: 500m
+ memory: 1Gi
+</pre>
+<br />
+<span>**persistent-volumes.yaml** - Storage configuration:</span><br />
+<br />
+<pre>
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: tempo-data-pv
+spec:
+ capacity:
+ storage: 10Gi
+ accessModes:
+ - ReadWriteOnce
+ persistentVolumeReclaimPolicy: Retain
+ hostPath:
+ path: /data/nfs/k3svolumes/tempo/data
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: tempo-data-pvc
+ namespace: monitoring
+spec:
+ storageClassName: ""
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 10Gi
+</pre>
+<br />
+<span>**Grafana Datasource Provisioning**</span><br />
+<br />
+<span>All Grafana datasources (Prometheus, Alertmanager, Loki, Tempo) are provisioned via a unified ConfigMap that is directly mounted to the Grafana pod. This approach ensures datasources are loaded on startup without requiring sidecar-based discovery.</span><br />
+<br />
+<span>In /home/paul/git/conf/f3s/prometheus/grafana-datasources-all.yaml:</span><br />
+<br />
+<pre>
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: grafana-datasources-all
+ namespace: monitoring
+data:
+ datasources.yaml: |
+ apiVersion: 1
+ datasources:
+ - name: Prometheus
+ type: prometheus
+ uid: prometheus
+ url: http://prometheus-kube-prometheus-prometheus.monitoring:9090/
+ access: proxy
+ isDefault: true
+ - name: Alertmanager
+ type: alertmanager
+ uid: alertmanager
+ url: http://prometheus-kube-prometheus-alertmanager.monitoring:9093/
+ - name: Loki
+ type: loki
+ uid: loki
+ url: http://loki.monitoring.svc.cluster.local:3100
+ - name: Tempo
+ type: tempo
+ uid: tempo
+ url: http://tempo.monitoring.svc.cluster.local:3200
+ jsonData:
+ tracesToLogsV2:
+ datasourceUid: loki
+ spanStartTimeShift: -1h
+ spanEndTimeShift: 1h
+ tracesToMetrics:
+ datasourceUid: prometheus
+ serviceMap:
+ datasourceUid: prometheus
+ nodeGraph:
+ enabled: true
+</pre>
+<br />
+<span>The kube-prometheus-stack Helm values (persistence-values.yaml) are configured to:</span><br />
+<ul>
+<li>Disable sidecar-based datasource provisioning</li>
+<li>Mount grafana-datasources-all ConfigMap directly to /etc/grafana/provisioning/datasources/</li>
+</ul><br />
+<span>This direct mounting approach is simpler and more reliable than sidecar-based discovery.</span><br />
+<br />
+<span>#### Installation</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/tempo
+just install
+</pre>
+<br />
+<span>Verify Tempo is running:</span><br />
+<br />
+<pre>
+kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
+kubectl exec -n monitoring &lt;tempo-pod&gt; -- wget -qO- http://localhost:3200/ready
+</pre>
+<br />
+<h3 style='display: inline' id='configuring-grafana-alloy-for-trace-collection'>Configuring Grafana Alloy for Trace Collection</h3><br />
+<br />
+<span>Updated /home/paul/git/conf/f3s/loki/alloy-values.yaml to add OTLP receivers for traces while maintaining existing log collection.</span><br />
+<br />
+<span>#### OTLP Receiver Configuration</span><br />
+<br />
+<span>Added to Alloy configuration after the log collection pipeline:</span><br />
+<br />
+<pre>
+// OTLP receiver for traces via gRPC and HTTP
+otelcol.receiver.otlp "default" {
+ grpc {
+ endpoint = "0.0.0.0:4317"
+ }
+ http {
+ endpoint = "0.0.0.0:4318"
+ }
+ output {
+ traces = [otelcol.processor.batch.default.input]
+ }
+}
+
+// Batch processor for efficient trace forwarding
+otelcol.processor.batch "default" {
+ timeout = "5s"
+ send_batch_size = 100
+ send_batch_max_size = 200
+ output {
+ traces = [otelcol.exporter.otlp.tempo.input]
+ }
+}
+
+// OTLP exporter to send traces to Tempo
+otelcol.exporter.otlp "tempo" {
+ client {
+ endpoint = "tempo.monitoring.svc.cluster.local:4317"
+ tls {
+ insecure = true
+ }
+ compression = "gzip"
+ }
+}
+</pre>
+<br />
+<span>The batch processor reduces network overhead by accumulating spans before forwarding to Tempo.</span><br />
+<br />
+<span>#### Upgrade Alloy</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/loki
+just upgrade
+</pre>
+<br />
+<span>Verify OTLP receivers are listening:</span><br />
+<br />
+<pre>
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i "otlp.*receiver"
+kubectl exec -n monitoring &lt;alloy-pod&gt; -- netstat -ln | grep -E &#39;:(4317|4318)&#39;
+</pre>
+<br />
+<h3 style='display: inline' id='demo-tracing-application'>Demo Tracing Application</h3><br />
+<br />
+<span>Created a three-tier Python application to demonstrate distributed tracing in action.</span><br />
+<br />
+<span>#### Application Architecture</span><br />
+<br />
+<pre>
+User → Frontend (Flask:5000) → Middleware (Flask:5001) → Backend (Flask:5002)
+ ↓ ↓ ↓
+ Alloy (OTLP:4317) → Tempo → Grafana
+</pre>
+<br />
+<span>**Frontend Service:**</span><br />
+<ul>
+<li>Receives HTTP requests at /api/process</li>
+<li>Forwards to middleware service</li>
+<li>Creates parent span for the entire request</li>
+</ul><br />
+<span>**Middleware Service:**</span><br />
+<ul>
+<li>Transforms data at /api/transform</li>
+<li>Calls backend service</li>
+<li>Creates child span linked to frontend</li>
+</ul><br />
+<span>**Backend Service:**</span><br />
+<ul>
+<li>Returns data at /api/data</li>
+<li>Simulates database query (100ms sleep)</li>
+<li>Creates leaf span in the trace</li>
+</ul><br />
+<span>#### OpenTelemetry Instrumentation</span><br />
+<br />
+<span>All services use Python OpenTelemetry libraries:</span><br />
+<br />
+<span>**Dependencies:**</span><br />
+<pre>
+flask==3.0.0
+requests==2.31.0
+opentelemetry-distro==0.49b0
+opentelemetry-exporter-otlp==1.28.0
+opentelemetry-instrumentation-flask==0.49b0
+opentelemetry-instrumentation-requests==0.49b0
+</pre>
+<br />
+<span>**Auto-instrumentation pattern** (used in all services):</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#ababab">from</font><font color="#ff0000"> opentelemetry </font><font color="#ababab">import</font><font color="#ff0000"> trace</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">sdk</font><font color="#F3E651">.</font><font color="#ff0000">trace </font><font color="#ababab">import</font><font color="#ff0000"> TracerProvider</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">exporter</font><font color="#F3E651">.</font><font color="#ff0000">otlp</font><font color="#F3E651">.</font><font color="#ff0000">proto</font><font color="#F3E651">.</font><font color="#ff0000">grpc</font><font color="#F3E651">.</font><font color="#ff0000">trace_exporter </font><font color="#ababab">import</font><font color="#ff0000"> OTLPSpanExporter</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">instrumentation</font><font color="#F3E651">.</font><font color="#ff0000">flask </font><font color="#ababab">import</font><font color="#ff0000"> FlaskInstrumentor</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">instrumentation</font><font color="#F3E651">.</font><font color="#ff0000">requests </font><font color="#ababab">import</font><font color="#ff0000"> RequestsInstrumentor</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">sdk</font><font color="#F3E651">.</font><font color="#ff0000">resources </font><font color="#ababab">import</font><font color="#ff0000"> Resource</font>
+
+<i><font color="#ababab"># Define service identity</font></i>
+<font color="#ff0000">resource </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">Resource</font><font color="#F3E651">(</font><font color="#ff0000">attributes</font><font color="#F3E651">={</font>
+<font color="#ff0000"> </font><font color="#bb00ff">"service.name"</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#bb00ff">"frontend"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#bb00ff">"service.namespace"</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#bb00ff">"tracing-demo"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#bb00ff">"service.version"</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#bb00ff">"1.0.0"</font>
+<font color="#F3E651">})</font>
+
+<font color="#ff0000">provider </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">TracerProvider</font><font color="#F3E651">(</font><font color="#ff0000">resource</font><font color="#F3E651">=</font><font color="#ff0000">resource</font><font color="#F3E651">)</font>
+
+<i><font color="#ababab"># Export to Alloy</font></i>
+<font color="#ff0000">otlp_exporter </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">OTLPSpanExporter</font><font color="#F3E651">(</font>
+<font color="#ff0000"> endpoint</font><font color="#F3E651">=</font><font color="#bb00ff">"http://alloy.monitoring.svc.cluster.local:4317"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> insecure</font><font color="#F3E651">=</font><font color="#ff0000">True</font>
+<font color="#F3E651">)</font>
+
+<font color="#ff0000">processor </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">BatchSpanProcessor</font><font color="#F3E651">(</font><font color="#ff0000">otlp_exporter</font><font color="#F3E651">)</font>
+<font color="#ff0000">provider</font><font color="#F3E651">.</font><font color="#7bc710">add_span_processor</font><font color="#F3E651">(</font><font color="#ff0000">processor</font><font color="#F3E651">)</font>
+<font color="#ff0000">trace</font><font color="#F3E651">.</font><font color="#7bc710">set_tracer_provider</font><font color="#F3E651">(</font><font color="#ff0000">provider</font><font color="#F3E651">)</font>
+
+<i><font color="#ababab"># Auto-instrument Flask and requests</font></i>
+<font color="#7bc710">FlaskInstrumentor</font><font color="#F3E651">().</font><font color="#7bc710">instrument_app</font><font color="#F3E651">(</font><font color="#ff0000">app</font><font color="#F3E651">)</font>
+<font color="#7bc710">RequestsInstrumentor</font><font color="#F3E651">().</font><font color="#7bc710">instrument</font><font color="#F3E651">()</font>
+</pre>
+<br />
+<span>The auto-instrumentation automatically:</span><br />
+<ul>
+<li>Creates spans for HTTP requests</li>
+<li>Propagates trace context via W3C Trace Context headers</li>
+<li>Links parent and child spans across service boundaries</li>
+</ul><br />
+<span>#### Deployment</span><br />
+<br />
+<span>Created Helm chart in /home/paul/git/conf/f3s/tracing-demo/ with three separate deployments, services, and an ingress.</span><br />
+<br />
+<span>Build and deploy:</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/tracing-demo
+just build
+just import
+just install
+</pre>
+<br />
+<span>Verify deployment:</span><br />
+<br />
+<pre>
+kubectl get pods -n services | grep tracing-demo
+kubectl get ingress -n services tracing-demo-ingress
+</pre>
+<br />
+<span>Access the application at:</span><br />
+<br />
+<a class='textlink' href='http://tracing-demo.f3s.buetow.org'>http://tracing-demo.f3s.buetow.org</a><br />
+<br />
+<h3 style='display: inline' id='visualizing-traces-in-grafana'>Visualizing Traces in Grafana</h3><br />
+<br />
+<span>The Tempo datasource is automatically discovered by Grafana through the ConfigMap label.</span><br />
+<br />
+<span>#### Accessing Traces</span><br />
+<br />
+<span>Navigate to Grafana → Explore → Select "Tempo" datasource</span><br />
+<br />
+<span>**Search Interface:**</span><br />
+<ul>
+<li>Search by Trace ID</li>
+<li>Search by service name</li>
+<li>Search by tags</li>
+</ul><br />
+<span>**TraceQL Queries:**</span><br />
+<br />
+<span>Find all traces from demo app:</span><br />
+<pre>
+{ resource.service.namespace = "tracing-demo" }
+</pre>
+<br />
+<span>Find slow requests (&gt;200ms):</span><br />
+<pre>
+{ duration &gt; 200ms }
+</pre>
+<br />
+<span>Find traces from specific service:</span><br />
+<pre>
+{ resource.service.name = "frontend" }
+</pre>
+<br />
+<span>Find errors:</span><br />
+<pre>
+{ status = error }
+</pre>
+<br />
+<span>Complex query - frontend traces calling middleware:</span><br />
+<pre>
+{ resource.service.namespace = "tracing-demo" } &amp;&amp; { span.http.status_code &gt;= 500 }
+</pre>
+<br />
+<span>#### Service Graph Visualization</span><br />
+<br />
+<span>The service graph shows visual connections between services:</span><br />
+<br />
+<span>1. Navigate to Explore → Tempo</span><br />
+<span>2. Enable "Service Graph" view</span><br />
+<span>3. Shows: Frontend → Middleware → Backend with request rates</span><br />
+<br />
+<span>The service graph uses Prometheus metrics generated from trace data.</span><br />
+<br />
+<h3 style='display: inline' id='correlation-between-observability-signals'>Correlation Between Observability Signals</h3><br />
+<br />
+<span>Tempo integrates with Loki and Prometheus to provide unified observability.</span><br />
+<br />
+<span>#### Traces-to-Logs</span><br />
+<br />
+<span>Click on any span in a trace to see related logs:</span><br />
+<br />
+<span>1. View trace in Grafana</span><br />
+<span>2. Click on a span</span><br />
+<span>3. Select "Logs for this span"</span><br />
+<span>4. Loki shows logs filtered by:</span><br />
+<span> * Time range (span duration ± 1 hour)</span><br />
+<span> * Service name</span><br />
+<span> * Namespace</span><br />
+<span> * Pod</span><br />
+<br />
+<span>This helps correlate what the service was doing when the span was created.</span><br />
+<br />
+<span>#### Traces-to-Metrics</span><br />
+<br />
+<span>View Prometheus metrics for services in the trace:</span><br />
+<br />
+<span>1. View trace in Grafana</span><br />
+<span>2. Select "Metrics" tab</span><br />
+<span>3. Shows metrics like:</span><br />
+<span> * Request rate</span><br />
+<span> * Error rate</span><br />
+<span> * Duration percentiles</span><br />
+<br />
+<span>#### Logs-to-Traces</span><br />
+<br />
+<span>From logs, you can jump to related traces:</span><br />
+<br />
+<span>1. In Loki, logs that contain trace IDs are automatically linked</span><br />
+<span>2. Click the trace ID to view the full trace</span><br />
+<span>3. See the complete request flow</span><br />
+<br />
+<h3 style='display: inline' id='generating-traces-for-testing'>Generating Traces for Testing</h3><br />
+<br />
+<span>Test the demo application:</span><br />
+<br />
+<pre>
+curl http://tracing-demo.f3s.buetow.org/api/process
+</pre>
+<br />
+<span>Load test (generates 50 traces):</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/tracing-demo
+just load-test
+</pre>
+<br />
+<span>Each request creates a distributed trace spanning all three services.</span><br />
+<br />
+<h3 style='display: inline' id='verifying-the-complete-pipeline'>Verifying the Complete Pipeline</h3><br />
+<br />
+<span>Check the trace flow end-to-end:</span><br />
+<br />
+<span>**1. Application generates traces:**</span><br />
+<pre>
+kubectl logs -n services -l app=tracing-demo-frontend | grep -i trace
+</pre>
+<br />
+<span>**2. Alloy receives traces:**</span><br />
+<pre>
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i otlp
+</pre>
+<br />
+<span>**3. Tempo stores traces:**</span><br />
+<pre>
+kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i trace
+</pre>
+<br />
+<span>**4. Grafana displays traces:**</span><br />
+<span>Navigate to Explore → Tempo → Search for traces</span><br />
+<br />
+<h3 style='display: inline' id='practical-example-viewing-a-distributed-trace'>Practical Example: Viewing a Distributed Trace</h3><br />
+<br />
+<span>Let&#39;s generate a trace and examine it in Grafana.</span><br />
+<br />
+<span>**1. Generate a trace by calling the demo application:**</span><br />
+<br />
+<pre>
+curl -H "Host: tracing-demo.f3s.buetow.org" http://r0/api/process
+</pre>
+<br />
+<span>**Response (HTTP 200):**</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">middleware_response</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">backend_data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">id</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">12345</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">query_time_ms</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">100.0</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">timestamp</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">2025-12-28T18:35:01.064538</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">value</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">Sample data from backend service</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">service</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">backend</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">middleware_processed</font><font color="#ff0000">"</font><font color="#ff0000">: </font><b><font color="#ffffff">true</font></b><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">original_data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">source</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">GET request</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">transformation_time_ms</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">50</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">request_data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">source</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">GET request</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">service</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">frontend</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">status</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">success</font><font color="#ff0000">"</font>
+<font color="#F3E651">}</font>
+</pre>
+<br />
+<span>**2. Find the trace in Tempo via API:**</span><br />
+<br />
+<span>After a few seconds (for batch export), search for recent traces:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ &#39;http://localhost:3200/api/search?tags=service.namespace%3Dtracing-demo&amp;limit=5&#39; 2&gt;/dev/null | \
+ python3 -m json.tool
+</pre>
+<br />
+<span>Returns traces including:</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">traceID</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">4be1151c0bdcd5625ac7e02b98d95bd5</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">rootServiceName</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">frontend</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">rootTraceName</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">GET /api/process</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">durationMs</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">221</font>
+<font color="#F3E651">}</font>
+</pre>
+<br />
+<span>**3. Fetch complete trace details:**</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ &#39;http://localhost:3200/api/traces/4be1151c0bdcd5625ac7e02b98d95bd5&#39; 2&gt;/dev/null | \
+ python3 -m json.tool
+</pre>
+<br />
+<span>**Trace structure (8 spans across 3 services):**</span><br />
+<br />
+<pre>
+Trace ID: 4be1151c0bdcd5625ac7e02b98d95bd5
+Services: 3 (frontend, middleware, backend)
+
+Service: frontend
+ └─ GET /api/process 221.10ms (HTTP server span)
+ └─ frontend-process 216.23ms (custom business logic span)
+ └─ POST 209.97ms (HTTP client span to middleware)
+
+Service: middleware
+ └─ POST /api/transform 186.02ms (HTTP server span)
+ └─ middleware-transform 180.96ms (custom business logic span)
+ └─ GET 127.52ms (HTTP client span to backend)
+
+Service: backend
+ └─ GET /api/data 103.93ms (HTTP server span)
+ └─ backend-get-data 102.11ms (custom business logic span with 100ms sleep)
+</pre>
+<br />
+<span>**4. View the trace in Grafana UI:**</span><br />
+<br />
+<span>Navigate to: Grafana → Explore → Tempo datasource</span><br />
+<br />
+<span>Search using TraceQL:</span><br />
+<pre>
+{ resource.service.namespace = "tracing-demo" }
+</pre>
+<br />
+<span>Or directly open the trace by pasting the trace ID in the search box:</span><br />
+<pre>
+4be1151c0bdcd5625ac7e02b98d95bd5
+</pre>
+<br />
+<span>**5. Trace visualization:**</span><br />
+<br />
+<span>The trace waterfall view in Grafana shows the complete request flow with timing:</span><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-tempo-trace.png'><img alt='Distributed trace visualization in Grafana Tempo showing Frontend → Middleware → Backend spans' title='Distributed trace visualization in Grafana Tempo showing Frontend → Middleware → Backend spans' src='./f3s-kubernetes-with-freebsd-part-8/grafana-tempo-trace.png' /></a><br />
+<br />
+<span>For additional examples of Tempo trace visualization, see also:</span><br />
+<br />
+<a class='textlink' href='https://foo.zone/gemfeed/2025-12-24-x-rag-observability-hackathon.html'>X-RAG Observability Hackathon (more Grafana Tempo screenshots)</a><br />
+<br />
+<span>The trace reveals the distributed request flow:</span><br />
+<ul>
+<li>**Frontend (221ms)**: Receives GET /api/process, executes business logic, calls middleware</li>
+<li>**Middleware (186ms)**: Receives POST /api/transform, transforms data, calls backend</li>
+<li>**Backend (104ms)**: Receives GET /api/data, simulates database query with 100ms sleep</li>
+<li>**Total request time**: 221ms end-to-end</li>
+<li>**Span propagation**: W3C Trace Context headers automatically link all spans</li>
+</ul><br />
+<span>**6. Service graph visualization:**</span><br />
+<br />
+<span>The service graph is automatically generated from traces and shows service dependencies. For examples of service graph visualization in Grafana, see the screenshots in the X-RAG Observability Hackathon blog post.</span><br />
+<br />
+<a class='textlink' href='https://foo.zone/gemfeed/2025-12-24-x-rag-observability-hackathon.html'>X-RAG Observability Hackathon (includes service graph screenshots)</a><br />
+<br />
+<span>This visualization helps identify:</span><br />
+<ul>
+<li>Request rates between services</li>
+<li>Average latency for each hop</li>
+<li>Error rates (if any)</li>
+<li>Service dependencies and communication patterns</li>
+</ul><br />
+<h3 style='display: inline' id='storage-and-retention'>Storage and Retention</h3><br />
+<br />
+<span>Monitor Tempo storage usage:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring &lt;tempo-pod&gt; -- df -h /var/tempo
+</pre>
+<br />
+<span>With 10Gi storage and 7-day retention, the system handles moderate trace volumes. If storage fills up:</span><br />
+<br />
+<ul>
+<li>Reduce retention to 72h (3 days)</li>
+<li>Implement sampling in Alloy</li>
+<li>Increase PV size</li>
+</ul><br />
+<h3 style='display: inline' id='complete-observability-stack'>Complete Observability Stack</h3><br />
+<br />
+<span>The f3s cluster now has complete observability:</span><br />
+<br />
+<span>**Metrics** (Prometheus):</span><br />
+<ul>
+<li>Cluster resource usage</li>
+<li>Application metrics</li>
+<li>Node metrics (FreeBSD ZFS, OpenBSD edge)</li>
+<li>etcd health</li>
+</ul><br />
+<span>**Logs** (Loki):</span><br />
+<ul>
+<li>All pod logs</li>
+<li>Structured log collection</li>
+<li>Log aggregation and search</li>
+</ul><br />
+<span>**Traces** (Tempo):</span><br />
+<ul>
+<li>Distributed request tracing</li>
+<li>Service dependency mapping</li>
+<li>Performance profiling</li>
+<li>Error tracking</li>
+</ul><br />
+<span>**Visualization** (Grafana):</span><br />
+<ul>
+<li>Unified dashboards</li>
+<li>Correlation between metrics, logs, and traces</li>
+<li>Service graphs</li>
+<li>Alerts</li>
+</ul><br />
+<h3 style='display: inline' id='configuration-files'>Configuration Files</h3><br />
+<br />
+<span>All configuration files are available on Codeberg:</span><br />
+<br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/tempo'>Tempo configuration</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/loki'>Alloy configuration (updated for traces)</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/tracing-demo'>Demo tracing application</a><br />
+<br />
<h2 style='display: inline' id='summary'>Summary</h2><br />
<br />
-<span>With Prometheus, Grafana, Loki, and Alloy deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:</span><br />
+<span>With Prometheus, Grafana, Loki, Alloy, and Tempo deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:</span><br />
<br />
<ul>
-<li>metrics: Prometheus collects and stores time-series data from all components</li>
+<li>Metrics: Prometheus collects and stores time-series data from all components, including etcd and ZFS</li>
<li>Logs: Loki aggregates logs from all containers, searchable via Grafana</li>
-<li>Visualisation: Grafana provides dashboards and exploration tools</li>
+<li>Traces: Tempo provides distributed request tracing with service dependency mapping</li>
+<li>Visualisation: Grafana provides dashboards and exploration tools with correlation between all three signals</li>
<li>Alerting: Alertmanager can notify on conditions defined in Prometheus rules</li>
</ul><br />
<span>This observability stack runs entirely on the home lab infrastructure, with data persisted to the NFS share. It&#39;s lightweight enough for a three-node cluster but provides the same capabilities as production-grade setups.</span><br />
<br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus'>prometheus configuration on Codeberg</a><br />
+<br />
<span>Other *BSD-related posts:</span><br />
<br />
<a class='textlink' href='./2025-12-07-f3s-kubernetes-with-freebsd-part-8.html'>2025-12-07 f3s: Kubernetes with FreeBSD - Part 8: Observability (You are currently reading this)</a><br />