summaryrefslogtreecommitdiff
path: root/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html
diff options
context:
space:
mode:
Diffstat (limited to 'gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html')
-rw-r--r--gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html1037
1 files changed, 1027 insertions, 10 deletions
diff --git a/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html b/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html
index 53bc16d9..0cda1b53 100644
--- a/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html
+++ b/gemfeed/2025-12-07-f3s-kubernetes-with-freebsd-part-8.html
@@ -18,7 +18,7 @@
</p>
<h1 style='display: inline' id='f3s-kubernetes-with-freebsd---part-8-observability'>f3s: Kubernetes with FreeBSD - Part 8: Observability</h1><br />
<br />
-<span class='quote'>Published at 2025-12-06T23:58:24+02:00</span><br />
+<span class='quote'>Published at 2025-12-06T23:58:24+02:00, last updated Mon 09 Mar 09:33:08 EET 2026</span><br />
<br />
<span>This is the 8th blog post about the f3s series for my self-hosting demands in a home lab. f3s? The "f" stands for FreeBSD, and the "3s" stands for k3s, the Kubernetes distribution I use on FreeBSD-based physical machines.</span><br />
<br />
@@ -60,23 +60,56 @@
<li>⇢ ⇢ <a href='#adding-freebsd-hosts-to-prometheus'>Adding FreeBSD hosts to Prometheus</a></li>
<li>⇢ ⇢ <a href='#freebsd-memory-metrics-compatibility'>FreeBSD memory metrics compatibility</a></li>
<li>⇢ ⇢ <a href='#disk-io-metrics-limitation'>Disk I/O metrics limitation</a></li>
+<li>⇢ <a href='#zfs-monitoring-for-freebsd-servers'>ZFS Monitoring for FreeBSD Servers</a></li>
+<li>⇢ ⇢ <a href='#node-exporter-zfs-collector'>Node Exporter ZFS Collector</a></li>
+<li>⇢ ⇢ <a href='#verifying-zfs-metrics'>Verifying ZFS Metrics</a></li>
+<li>⇢ ⇢ <a href='#zfs-recording-rules'>ZFS Recording Rules</a></li>
+<li>⇢ ⇢ <a href='#grafana-dashboards'>Grafana Dashboards</a></li>
+<li>⇢ ⇢ <a href='#deployment'>Deployment</a></li>
+<li>⇢ ⇢ <a href='#verifying-zfs-metrics-in-prometheus'>Verifying ZFS Metrics in Prometheus</a></li>
+<li>⇢ ⇢ <a href='#key-metrics-to-monitor'>Key Metrics to Monitor</a></li>
+<li>⇢ ⇢ <a href='#zfs-pool-and-dataset-metrics-via-textfile-collector'>ZFS Pool and Dataset Metrics via Textfile Collector</a></li>
<li>⇢ <a href='#monitoring-external-openbsd-hosts'>Monitoring external OpenBSD hosts</a></li>
<li>⇢ ⇢ <a href='#installing-node-exporter-on-openbsd'>Installing Node Exporter on OpenBSD</a></li>
<li>⇢ ⇢ <a href='#adding-openbsd-hosts-to-prometheus'>Adding OpenBSD hosts to Prometheus</a></li>
<li>⇢ ⇢ <a href='#openbsd-memory-metrics-compatibility'>OpenBSD memory metrics compatibility</a></li>
+<li>⇢ <a href='#distributed-tracing-with-grafana-tempo'>Distributed Tracing with Grafana Tempo</a></li>
+<li>⇢ ⇢ <a href='#why-distributed-tracing'>Why Distributed Tracing?</a></li>
+<li>⇢ ⇢ <a href='#deploying-grafana-tempo'>Deploying Grafana Tempo</a></li>
+<li>⇢ <a href='#-configuration-strategy'>⇢# Configuration Strategy</a></li>
+<li>⇢ <a href='#-tempo-deployment-files'>⇢# Tempo Deployment Files</a></li>
+<li>⇢ <a href='#-installation'>⇢# Installation</a></li>
+<li>⇢ ⇢ <a href='#configuring-grafana-alloy-for-trace-collection'>Configuring Grafana Alloy for Trace Collection</a></li>
+<li>⇢ <a href='#-otlp-receiver-configuration'>⇢# OTLP Receiver Configuration</a></li>
+<li>⇢ <a href='#-upgrade-alloy'>⇢# Upgrade Alloy</a></li>
+<li>⇢ ⇢ <a href='#demo-tracing-application'>Demo Tracing Application</a></li>
+<li>⇢ <a href='#-application-architecture'>⇢# Application Architecture</a></li>
+<li>⇢ ⇢ <a href='#visualizing-traces-in-grafana'>Visualizing Traces in Grafana</a></li>
+<li>⇢ <a href='#-accessing-traces'>⇢# Accessing Traces</a></li>
+<li>⇢ <a href='#-service-graph-visualization'>⇢# Service Graph Visualization</a></li>
+<li>⇢ ⇢ <a href='#correlation-between-observability-signals'>Correlation Between Observability Signals</a></li>
+<li>⇢ <a href='#-traces-to-logs'>⇢# Traces-to-Logs</a></li>
+<li>⇢ <a href='#-traces-to-metrics'>⇢# Traces-to-Metrics</a></li>
+<li>⇢ <a href='#-logs-to-traces'>⇢# Logs-to-Traces</a></li>
+<li>⇢ ⇢ <a href='#generating-traces-for-testing'>Generating Traces for Testing</a></li>
+<li>⇢ ⇢ <a href='#verifying-the-complete-pipeline'>Verifying the Complete Pipeline</a></li>
+<li>⇢ ⇢ <a href='#practical-example-viewing-a-distributed-trace'>Practical Example: Viewing a Distributed Trace</a></li>
+<li>⇢ ⇢ <a href='#storage-and-retention'>Storage and Retention</a></li>
+<li>⇢ ⇢ <a href='#configuration-files'>Configuration Files</a></li>
<li>⇢ <a href='#summary'>Summary</a></li>
</ul><br />
<h2 style='display: inline' id='introduction'>Introduction</h2><br />
<br />
-<span>In this blog post, I set up a complete observability stack for the k3s cluster. Observability is crucial for understanding what&#39;s happening inside the cluster—whether its tracking resource usage, debugging issues, or analysing application behaviour. The stack consists of four main components, all deployed into the <span class='inlinecode'>monitoring</span> namespace:</span><br />
+<span>In this blog post, I set up a complete observability stack for the k3s cluster. Observability is crucial for understanding what&#39;s happening inside the cluster—whether its tracking resource usage, debugging issues, or analysing application behaviour. The stack consists of five main components, all deployed into the <span class='inlinecode'>monitoring</span> namespace:</span><br />
<br />
<ul>
<li>Prometheus: time-series database for metrics collection and alerting</li>
<li>Grafana: visualisation and dashboarding frontend</li>
<li>Loki: log aggregation system (like Prometheus, but for logs)</li>
-<li>Alloy: telemetry collector that ships logs from all pods to Loki</li>
+<li>Alloy: telemetry collector that ships logs and traces from all pods to Loki and Tempo</li>
+<li>Tempo: distributed tracing backend for request flow analysis across microservices</li>
</ul><br />
-<span>Together, these form the "PLG" stack (Prometheus, Loki, Grafana), which is a popular open-source alternative to commercial observability platforms.</span><br />
+<span>Together, these form the "PLG" stack (Prometheus, Loki, Grafana) extended with Tempo for distributed tracing, which is a popular open-source alternative to commercial observability platforms.</span><br />
<br />
<span>All manifests for the f3s stack live in my configuration repository:</span><br />
<br />
@@ -120,6 +153,7 @@ http://www.gnu.org/software/src-highlite -->
<li><span class='inlinecode'>/data/nfs/k3svolumes/prometheus/data</span> — Prometheus time-series database</li>
<li><span class='inlinecode'>/data/nfs/k3svolumes/grafana/data</span> — Grafana configuration, dashboards, and plugins</li>
<li><span class='inlinecode'>/data/nfs/k3svolumes/loki/data</span> — Loki log chunks and index</li>
+<li><span class='inlinecode'>/data/nfs/k3svolumes/tempo/data</span> — Tempo trace data and WAL</li>
</ul><br />
<span>Each path gets a corresponding <span class='inlinecode'>PersistentVolume</span> and <span class='inlinecode'>PersistentVolumeClaim</span> in Kubernetes, allowing pods to mount them as regular volumes. Because the underlying storage is ZFS with replication, we get snapshots and redundancy for free.</span><br />
<br />
@@ -218,7 +252,7 @@ kubeControllerManager:
insecureSkipVerify: true
</pre>
<br />
-<span>By default, k3s binds the controller-manager to localhost only, so the "Kubernetes / Controller Manager" dashboard in Grafana will show no data. To expose the metrics endpoint, add the following to <span class='inlinecode'>/etc/rancher/k3s/config.yaml</span> on each k3s server node:</span><br />
+<span>By default, k3s binds the controller-manager to localhost only and doesn&#39;t expose etcd metrics, so the "Kubernetes / Controller Manager" and "etcd" dashboards in Grafana will show no data. To fix both, add the following to <span class='inlinecode'>/etc/rancher/k3s/config.yaml</span> on each k3s server node:</span><br />
<br />
<!-- Generator: GNU source-highlight 3.1.9
by Lorenzo Bettini
@@ -227,11 +261,26 @@ http://www.gnu.org/software/src-highlite -->
<pre><font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># cat &gt;&gt; /etc/rancher/k3s/config.yaml &lt;&lt; 'EOF'</font></i>
<font color="#ff0000">kube-controller-manager-arg</font><font color="#F3E651">:</font>
<font color="#ff0000"> - bind-address</font><font color="#F3E651">=</font><font color="#bb00ff">0.0</font><font color="#F3E651">.</font><font color="#bb00ff">0.0</font>
+<font color="#ff0000">etcd-expose-metrics</font><font color="#F3E651">:</font><font color="#ff0000"> </font><b><font color="#ffffff">true</font></b>
<font color="#ff0000">EOF</font>
<font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># systemctl restart k3s</font></i>
</pre>
<br />
-<span>Repeat for <span class='inlinecode'>r1</span> and <span class='inlinecode'>r2</span>. After restarting all nodes, the controller-manager metrics endpoint will be accessible and Prometheus can scrape it.</span><br />
+<span>Repeat for <span class='inlinecode'>r1</span> and <span class='inlinecode'>r2</span>. After restarting all nodes, the controller-manager metrics endpoint will be accessible and etcd metrics are available on port 2381. Prometheus can now scrape both.</span><br />
+<br />
+<span>Verify etcd metrics are exposed:</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#F3E651">[</font><font color="#ff0000">root@r0 </font><font color="#F3E651">~]</font><i><font color="#ababab"># curl -s http://127.0.0.1:2381/metrics | grep etcd_server_has_leader</font></i>
+<font color="#ff0000">etcd_server_has_leader </font><font color="#bb00ff">1</font>
+</pre>
+<br />
+<span>The full <span class='inlinecode'>persistence-values.yaml</span> and all other Prometheus configuration files are available on Codeberg:</span><br />
+<br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus'>codeberg.org/snonux/conf/f3s/prometheus</a><br />
<br />
<span>The persistent volume definitions bind to specific paths on the NFS share using <span class='inlinecode'>hostPath</span> volumes—the same pattern used for other services in Part 7:</span><br />
<br />
@@ -258,6 +307,8 @@ http://www.gnu.org/software/src-highlite -->
<br />
<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-dashboard.png'><img alt='Grafana dashboard showing cluster metrics' title='Grafana dashboard showing cluster metrics' src='./f3s-kubernetes-with-freebsd-part-8/grafana-dashboard.png' /></a><br />
<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-etcd-dashboard.png'><img alt='Grafana etcd dashboard showing cluster health, RPC rate, disk sync duration, and peer round trip times' title='Grafana etcd dashboard showing cluster health, RPC rate, disk sync duration, and peer round trip times' src='./f3s-kubernetes-with-freebsd-part-8/grafana-etcd-dashboard.png' /></a><br />
+<br />
<h2 style='display: inline' id='installing-loki-and-alloy'>Installing Loki and Alloy</h2><br />
<br />
<span>While Prometheus handles metrics, Loki handles logs. It&#39;s designed to be cost-effective and easy to operate—it doesn&#39;t index the contents of logs, only the metadata (labels), making it very efficient for storage.</span><br />
@@ -409,8 +460,11 @@ http://www.gnu.org/software/src-highlite -->
<font color="#ff0000">prometheus-prometheus-node-exporter-2nsg9 </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
<font color="#ff0000">prometheus-prometheus-node-exporter-mqr</font><font color="#bb00ff">25</font><font color="#ff0000"> </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
<font color="#ff0000">prometheus-prometheus-node-exporter-wp4ds </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 42d</font>
+<font color="#ff0000">tempo-</font><font color="#bb00ff">0</font><font color="#ff0000"> </font><font color="#bb00ff">1</font><font color="#F3E651">/</font><font color="#bb00ff">1</font><font color="#ff0000"> Running </font><font color="#bb00ff">0</font><font color="#ff0000"> 1d</font>
</pre>
<br />
+<span>Note: Tempo (<span class='inlinecode'>tempo-0</span>) is deployed later in this post in the "Distributed Tracing with Grafana Tempo" section. It is included in the pod listing here for completeness.</span><br />
+<br />
<span>And the services:</span><br />
<br />
<!-- Generator: GNU source-highlight 3.1.9
@@ -429,6 +483,7 @@ http://www.gnu.org/software/src-highlite -->
<font color="#ff0000">prometheus-kube-prometheus-prometheus ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">152.163</font><font color="#ff0000"> </font><font color="#bb00ff">9090</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">8080</font><font color="#ff0000">/TCP</font>
<font color="#ff0000">prometheus-kube-state-metrics ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">64.26</font><font color="#ff0000"> </font><font color="#bb00ff">8080</font><font color="#ff0000">/TCP</font>
<font color="#ff0000">prometheus-prometheus-node-exporter ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">127.242</font><font color="#ff0000"> </font><font color="#bb00ff">9100</font><font color="#ff0000">/TCP</font>
+<font color="#ff0000">tempo ClusterIP </font><font color="#bb00ff">10.43</font><font color="#F3E651">.</font><font color="#bb00ff">91.44</font><font color="#ff0000"> </font><font color="#bb00ff">3200</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">4317</font><font color="#ff0000">/TCP</font><font color="#F3E651">,</font><font color="#bb00ff">4318</font><font color="#ff0000">/TCP</font>
</pre>
<br />
<span>Let me break down what each pod does:</span><br />
@@ -457,6 +512,9 @@ http://www.gnu.org/software/src-highlite -->
<ul>
<li><span class='inlinecode'>prometheus-prometheus-node-exporter-...</span>: three Node Exporter pods running as a DaemonSet, one on each node. They expose hardware and OS-level metrics: CPU usage, memory, disk I/O, filesystem usage, network statistics, and more. These feed the "Node Exporter" dashboards in Grafana.</li>
</ul><br />
+<ul>
+<li><span class='inlinecode'>tempo-0</span>: the Grafana Tempo instance for distributed tracing. It receives trace data from Alloy via OTLP (OpenTelemetry Protocol), stores traces on the NFS-backed persistent volume, and serves queries to Grafana. Tempo is covered in detail in the "Distributed Tracing with Grafana Tempo" section later in this post.</li>
+</ul><br />
<h2 style='display: inline' id='using-the-observability-stack'>Using the observability stack</h2><br />
<br />
<h3 style='display: inline' id='viewing-metrics-in-grafana'>Viewing metrics in Grafana</h3><br />
@@ -642,7 +700,313 @@ spec:
<br />
<span>Unlike memory metrics, disk I/O metrics (<span class='inlinecode'>node_disk_read_bytes_total</span>, <span class='inlinecode'>node_disk_written_bytes_total</span>, etc.) are not available on FreeBSD. The Linux diskstats collector that provides these metrics doesn&#39;t have a FreeBSD equivalent in the node_exporter.</span><br />
<br />
-<span>The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (<span class='inlinecode'>node_zfs_arcstats_*</span>) for ARC cache performance, and per-dataset I/O stats are available via <span class='inlinecode'>sysctl kstat.zfs</span>, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. Creating custom ZFS-specific dashboards is left as an exercise for another day.</span><br />
+<span>The disk I/O panels in the Node Exporter dashboards will show "No data" for FreeBSD hosts. FreeBSD does expose ZFS-specific metrics (<span class='inlinecode'>node_zfs_arcstats_*</span>) for ARC cache performance, and per-dataset I/O stats are available via <span class='inlinecode'>sysctl kstat.zfs</span>, but mapping these to the Linux-style metrics the dashboards expect is non-trivial. To address this, I created custom ZFS-specific dashboards, covered in the next section.</span><br />
+<br />
+<h2 style='display: inline' id='zfs-monitoring-for-freebsd-servers'>ZFS Monitoring for FreeBSD Servers</h2><br />
+<br />
+<span>The FreeBSD servers (f0, f1, f2) that provide NFS storage to the k3s cluster have ZFS filesystems. Monitoring ZFS performance is crucial for understanding storage performance and cache efficiency.</span><br />
+<br />
+<h3 style='display: inline' id='node-exporter-zfs-collector'>Node Exporter ZFS Collector</h3><br />
+<br />
+<span>The node_exporter running on each FreeBSD server (v1.9.1) includes a built-in ZFS collector that exposes metrics via sysctls. The ZFS collector is enabled by default and provides:</span><br />
+<br />
+<ul>
+<li>ARC (Adaptive Replacement Cache) statistics</li>
+<li>Cache hit/miss rates</li>
+<li>Memory usage and allocation</li>
+<li>MRU/MFU cache breakdown</li>
+<li>Data vs metadata distribution</li>
+</ul><br />
+<h3 style='display: inline' id='verifying-zfs-metrics'>Verifying ZFS Metrics</h3><br />
+<br />
+<span>On any FreeBSD server, check that ZFS metrics are being exposed:</span><br />
+<br />
+<pre>
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep node_zfs_arcstats | wc -l
+ 69
+</pre>
+<br />
+<span>The metrics are automatically scraped by Prometheus through the existing static configuration in additional-scrape-configs.yaml which targets all FreeBSD servers on port 9100 with the os: freebsd label.</span><br />
+<br />
+<h3 style='display: inline' id='zfs-recording-rules'>ZFS Recording Rules</h3><br />
+<br />
+<span>Created recording rules for easier dashboard consumption in zfs-recording-rules.yaml:</span><br />
+<br />
+<pre>
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: freebsd-zfs-rules
+ namespace: monitoring
+ labels:
+ release: prometheus
+spec:
+ groups:
+ - name: freebsd-zfs-arc
+ interval: 30s
+ rules:
+ - record: node_zfs_arc_hit_rate_percent
+ expr: |
+ 100 * (
+ rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) /
+ (rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) +
+ rate(node_zfs_arcstats_misses_total{os="freebsd"}[5m]))
+ )
+ labels:
+ os: freebsd
+ - record: node_zfs_arc_memory_usage_percent
+ expr: |
+ 100 * (
+ node_zfs_arcstats_size_bytes{os="freebsd"} /
+ node_zfs_arcstats_c_max_bytes{os="freebsd"}
+ )
+ labels:
+ os: freebsd
+ # Additional rules for metadata %, target %, MRU/MFU %, etc.
+</pre>
+<br />
+<span>These recording rules calculate:</span><br />
+<br />
+<ul>
+<li>ARC hit rate percentage</li>
+<li>ARC memory usage percentage (current vs maximum)</li>
+<li>ARC target percentage (target vs maximum)</li>
+<li>Metadata vs data percentages</li>
+<li>MRU vs MFU cache percentages</li>
+<li>Demand data and metadata hit rates</li>
+</ul><br />
+<h3 style='display: inline' id='grafana-dashboards'>Grafana Dashboards</h3><br />
+<br />
+<span>Created two comprehensive ZFS monitoring dashboards (zfs-dashboards.yaml):</span><br />
+<br />
+<span>**Dashboard 1: FreeBSD ZFS (per-host detailed view)**</span><br />
+<br />
+<span>Includes variables to select:</span><br />
+<br />
+<ul>
+<li>FreeBSD server (f0, f1, or f2)</li>
+<li>ZFS pool (zdata, zroot, or all)</li>
+</ul><br />
+<span>Pool Overview Row:</span><br />
+<br />
+<ul>
+<li>Pool Capacity gauge (with thresholds: green &lt;70%, yellow &lt;85%, red &gt;85%)</li>
+<li>Pool Health status (ONLINE/DEGRADED/FAULTED with color coding)</li>
+<li>Total Pool Size stat</li>
+<li>Free Space stat</li>
+<li>Pool Space Usage Over Time (stacked: used + free)</li>
+<li>Pool Capacity Trend time series</li>
+</ul><br />
+<span>Dataset Statistics Row:</span><br />
+<br />
+<ul>
+<li>Table showing all datasets with columns: Pool, Dataset, Used, Available, Referenced</li>
+<li>Automatically filters by selected pool</li>
+</ul><br />
+<span>ARC Cache Statistics Row:</span><br />
+<br />
+<ul>
+<li>ARC Hit Rate gauge (red &lt;70%, yellow &lt;90%, green &gt;=90%)</li>
+<li>ARC Size time series (current, target, max)</li>
+<li>ARC Memory Usage percentage gauge</li>
+<li>ARC Hits vs Misses rate</li>
+<li>ARC Data vs Metadata stacked time series</li>
+</ul><br />
+<span>**Dashboard 2: FreeBSD ZFS Summary (cluster-wide overview)**</span><br />
+<br />
+<span>Cluster-Wide Pool Statistics Row:</span><br />
+<br />
+<ul>
+<li>Total Storage Capacity across all servers</li>
+<li>Total Used space</li>
+<li>Total Free space</li>
+<li>Average Pool Capacity gauge</li>
+<li>Pool Health Status (worst case across cluster)</li>
+<li>Total Pool Space Usage Over Time</li>
+<li>Per-Pool Capacity time series (all pools on all hosts)</li>
+</ul><br />
+<span>Per-Host Pool Breakdown Row:</span><br />
+<br />
+<ul>
+<li>Bar gauge showing capacity by host and pool</li>
+<li>Table with all pools: Host, Pool, Size, Used, Free, Capacity %, Health</li>
+</ul><br />
+<span>Cluster-Wide ARC Statistics Row:</span><br />
+<br />
+<ul>
+<li>Average ARC Hit Rate gauge across all hosts</li>
+<li>ARC Hit Rate by Host time series</li>
+<li>Total ARC Size Across Cluster</li>
+<li>Total ARC Hits vs Misses (cluster-wide sum)</li>
+<li>ARC Size by Host</li>
+</ul><br />
+<span>Dashboard Visualization:</span><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-dashboard.png'><img alt='ZFS monitoring dashboard in Grafana showing pool capacity, health, and I/O throughput' title='ZFS monitoring dashboard in Grafana showing pool capacity, health, and I/O throughput' src='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-dashboard.png' /></a><br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-arc-stats.png'><img alt='ZFS ARC cache statistics showing hit rate, memory usage, and size trends' title='ZFS ARC cache statistics showing hit rate, memory usage, and size trends' src='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-arc-stats.png' /></a><br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-datasets.png'><img alt='ZFS datasets table and ARC data vs metadata breakdown' title='ZFS datasets table and ARC data vs metadata breakdown' src='./f3s-kubernetes-with-freebsd-part-8/grafana-zfs-datasets.png' /></a><br />
+<br />
+<h3 style='display: inline' id='deployment'>Deployment</h3><br />
+<br />
+<span>Applied the resources to the cluster:</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/prometheus
+kubectl apply -f zfs-recording-rules.yaml
+kubectl apply -f zfs-dashboards.yaml
+</pre>
+<br />
+<span>Updated Justfile to include ZFS recording rules in install and upgrade targets:</span><br />
+<br />
+<pre>
+install:
+ kubectl apply -f persistent-volumes.yaml
+ kubectl create secret generic additional-scrape-configs --from-file=additional-scrape-configs.yaml -n monitoring --dry-run=client -o yaml | kubectl apply -f -
+ helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring -f persistence-values.yaml
+ kubectl apply -f freebsd-recording-rules.yaml
+ kubectl apply -f openbsd-recording-rules.yaml
+ kubectl apply -f zfs-recording-rules.yaml
+ just -f grafana-ingress/Justfile install
+</pre>
+<br />
+<h3 style='display: inline' id='verifying-zfs-metrics-in-prometheus'>Verifying ZFS Metrics in Prometheus</h3><br />
+<br />
+<span>Check that ZFS metrics are being collected:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- &#39;http://localhost:9090/api/v1/query?query=node_zfs_arcstats_size_bytes&#39;
+</pre>
+<br />
+<span>Check recording rules are calculating correctly:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
+ wget -qO- &#39;http://localhost:9090/api/v1/query?query=node_zfs_arc_memory_usage_percent&#39;
+</pre>
+<br />
+<span>Example output shows memory usage percentage for each FreeBSD server:</span><br />
+<br />
+<pre>
+"result":[
+ {"metric":{"instance":"192.168.2.130:9100","os":"freebsd"},"value":[...,"37.58"]},
+ {"metric":{"instance":"192.168.2.131:9100","os":"freebsd"},"value":[...,"12.85"]},
+ {"metric":{"instance":"192.168.2.132:9100","os":"freebsd"},"value":[...,"13.44"]}
+]
+</pre>
+<br />
+<h3 style='display: inline' id='key-metrics-to-monitor'>Key Metrics to Monitor</h3><br />
+<br />
+<ul>
+<li>ARC Hit Rate: Should typically be above 90% for optimal performance. Lower hit rates indicate the ARC cache is too small or workload has poor locality.</li>
+<li>ARC Memory Usage: Shows how much of the maximum ARC size is being used. If consistently at or near maximum, the ARC is effectively utilizing available memory.</li>
+<li>Data vs Metadata: Typically data should dominate, but workloads with many small files will show higher metadata percentages.</li>
+<li>MRU vs MFU: Most Recently Used vs Most Frequently Used cache. The ratio depends on workload characteristics.</li>
+<li>Pool Capacity: Monitor pool usage to ensure adequate free space. ZFS performance degrades when pools exceed 80% capacity.</li>
+<li>Pool Health: Should always show ONLINE (green). DEGRADED (yellow) indicates a disk issue requiring attention. FAULTED (red) requires immediate action.</li>
+<li>Dataset Usage: Track which datasets are consuming the most space to identify growth trends and plan capacity.</li>
+</ul><br />
+<h3 style='display: inline' id='zfs-pool-and-dataset-metrics-via-textfile-collector'>ZFS Pool and Dataset Metrics via Textfile Collector</h3><br />
+<br />
+<span>To complement the ARC statistics from node_exporter&#39;s built-in ZFS collector, I added pool capacity and dataset metrics using the textfile collector feature.</span><br />
+<br />
+<span>Created a script at <span class='inlinecode'>/usr/local/bin/zfs_pool_metrics.sh</span> on each FreeBSD server:</span><br />
+<br />
+<pre>
+#!/bin/sh
+# ZFS Pool and Dataset Metrics Collector for Prometheus
+
+OUTPUT_FILE="/var/tmp/node_exporter/zfs_pools.prom.$$"
+FINAL_FILE="/var/tmp/node_exporter/zfs_pools.prom"
+
+mkdir -p /var/tmp/node_exporter
+
+{
+ # Pool metrics
+ echo "# HELP zfs_pool_size_bytes Total size of ZFS pool"
+ echo "# TYPE zfs_pool_size_bytes gauge"
+ echo "# HELP zfs_pool_allocated_bytes Allocated space in ZFS pool"
+ echo "# TYPE zfs_pool_allocated_bytes gauge"
+ echo "# HELP zfs_pool_free_bytes Free space in ZFS pool"
+ echo "# TYPE zfs_pool_free_bytes gauge"
+ echo "# HELP zfs_pool_capacity_percent Capacity percentage"
+ echo "# TYPE zfs_pool_capacity_percent gauge"
+ echo "# HELP zfs_pool_health Pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)"
+ echo "# TYPE zfs_pool_health gauge"
+
+ zpool list -Hp -o name,size,allocated,free,capacity,health | \
+ while IFS=$&#39;\t&#39; read name size alloc free cap health; do
+ case "$health" in
+ ONLINE) health_val=0 ;;
+ DEGRADED) health_val=1 ;;
+ FAULTED) health_val=2 ;;
+ *) health_val=6 ;;
+ esac
+ cap_num=$(echo "$cap" | sed &#39;s/%//&#39;)
+
+ echo "zfs_pool_size_bytes{pool=\"$name\"} $size"
+ echo "zfs_pool_allocated_bytes{pool=\"$name\"} $alloc"
+ echo "zfs_pool_free_bytes{pool=\"$name\"} $free"
+ echo "zfs_pool_capacity_percent{pool=\"$name\"} $cap_num"
+ echo "zfs_pool_health{pool=\"$name\"} $health_val"
+ done
+
+ # Dataset metrics
+ echo "# HELP zfs_dataset_used_bytes Used space in dataset"
+ echo "# TYPE zfs_dataset_used_bytes gauge"
+ echo "# HELP zfs_dataset_available_bytes Available space"
+ echo "# TYPE zfs_dataset_available_bytes gauge"
+ echo "# HELP zfs_dataset_referenced_bytes Referenced space"
+ echo "# TYPE zfs_dataset_referenced_bytes gauge"
+
+ zfs list -Hp -t filesystem -o name,used,available,referenced | \
+ while IFS=$&#39;\t&#39; read name used avail ref; do
+ pool=$(echo "$name" | cut -d/ -f1)
+ echo "zfs_dataset_used_bytes{pool=\"$pool\",dataset=\"$name\"} $used"
+ echo "zfs_dataset_available_bytes{pool=\"$pool\",dataset=\"$name\"} $avail"
+ echo "zfs_dataset_referenced_bytes{pool=\"$pool\",dataset=\"$name\"} $ref"
+ done
+} &gt; "$OUTPUT_FILE"
+
+mv "$OUTPUT_FILE" "$FINAL_FILE"
+</pre>
+<br />
+<span>Deployed to all FreeBSD servers:</span><br />
+<br />
+<pre>
+for host in f0 f1 f2; do
+ scp /tmp/zfs_pool_metrics.sh paul@$host:/tmp/
+ ssh paul@$host &#39;doas mv /tmp/zfs_pool_metrics.sh /usr/local/bin/ &amp;&amp; \
+ doas chmod +x /usr/local/bin/zfs_pool_metrics.sh&#39;
+done
+</pre>
+<br />
+<span>Set up cron jobs to run every minute:</span><br />
+<br />
+<pre>
+for host in f0 f1 f2; do
+ ssh paul@$host &#39;echo "* * * * * /usr/local/bin/zfs_pool_metrics.sh &gt;/dev/null 2&gt;&amp;1" | \
+ doas crontab -&#39;
+done
+</pre>
+<br />
+<span>The textfile collector (already configured with --collector.textfile.directory=/var/tmp/node_exporter) automatically picks up the metrics.</span><br />
+<br />
+<span>Verify metrics are being exposed:</span><br />
+<br />
+<pre>
+paul@f0:~ % curl -s http://localhost:9100/metrics | grep "^zfs_pool" | head -5
+zfs_pool_allocated_bytes{pool="zdata"} 6.47622733824e+11
+zfs_pool_allocated_bytes{pool="zroot"} 5.3338578944e+10
+zfs_pool_capacity_percent{pool="zdata"} 64
+zfs_pool_capacity_percent{pool="zroot"} 10
+zfs_pool_free_bytes{pool="zdata"} 3.48809678848e+11
+</pre>
+<br />
+<span>All ZFS-related configuration files are available on Codeberg:</span><br />
+<br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus/zfs-recording-rules.yaml'>zfs-recording-rules.yaml on Codeberg</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus/zfs-dashboards.yaml'>zfs-dashboards.yaml on Codeberg</a><br />
<br />
<h2 style='display: inline' id='monitoring-external-openbsd-hosts'>Monitoring external OpenBSD hosts</h2><br />
<br />
@@ -769,18 +1133,671 @@ spec:
<br />
<span>After running <span class='inlinecode'>just upgrade</span>, the OpenBSD hosts appear in Prometheus targets and the Node Exporter dashboards.</span><br />
<br />
+<h2 style='display: inline' id='distributed-tracing-with-grafana-tempo'>Distributed Tracing with Grafana Tempo</h2><br />
+<br />
+<span>After implementing logs (Loki) and metrics (Prometheus), the final pillar of observability is distributed tracing. Grafana Tempo provides distributed tracing capabilities that help understand request flows across microservices.</span><br />
+<br />
+<span>For a preview of what distributed tracing with Tempo looks like in Grafana, see the X-RAG blog post:</span><br />
+<br />
+<a class='textlink' href='./2025-12-24-x-rag-observability-hackathon.html'>X-RAG Observability Hackathon</a><br />
+<br />
+<h3 style='display: inline' id='why-distributed-tracing'>Why Distributed Tracing?</h3><br />
+<br />
+<span>In a microservices architecture, a single user request may traverse multiple services. Distributed tracing:</span><br />
+<br />
+<ul>
+<li>Tracks requests across service boundaries</li>
+<li>Identifies performance bottlenecks</li>
+<li>Visualizes service dependencies</li>
+<li>Correlates with logs and metrics</li>
+<li>Helps debug complex distributed systems</li>
+</ul><br />
+<h3 style='display: inline' id='deploying-grafana-tempo'>Deploying Grafana Tempo</h3><br />
+<br />
+<span>Tempo is deployed in monolithic mode, following the same pattern as Loki&#39;s SingleBinary deployment.</span><br />
+<br />
+<span>#### Configuration Strategy</span><br />
+<br />
+<span>**Deployment Mode:** Monolithic (all components in one process)</span><br />
+<ul>
+<li>Simpler operation than microservices mode</li>
+<li>Suitable for the cluster scale</li>
+<li>Consistent with Loki deployment pattern</li>
+</ul><br />
+<span>**Storage:** Filesystem backend using hostPath</span><br />
+<ul>
+<li>10Gi storage at /data/nfs/k3svolumes/tempo/data</li>
+<li>7-day retention (168h)</li>
+<li>Local storage is the only option for monolithic mode</li>
+</ul><br />
+<span>**OTLP Receivers:** Standard OpenTelemetry Protocol ports</span><br />
+<ul>
+<li>gRPC: 4317</li>
+<li>HTTP: 4318</li>
+<li>Bind to 0.0.0.0 to avoid Tempo 2.7+ localhost-only binding issue</li>
+</ul><br />
+<span>#### Tempo Deployment Files</span><br />
+<br />
+<span>Created in /home/paul/git/conf/f3s/tempo/:</span><br />
+<br />
+<span>**values.yaml** - Helm chart configuration:</span><br />
+<br />
+<pre>
+tempo:
+ retention: 168h
+ storage:
+ trace:
+ backend: local
+ local:
+ path: /var/tempo/traces
+ wal:
+ path: /var/tempo/wal
+ receivers:
+ otlp:
+ protocols:
+ grpc:
+ endpoint: 0.0.0.0:4317
+ http:
+ endpoint: 0.0.0.0:4318
+
+persistence:
+ enabled: true
+ size: 10Gi
+ storageClassName: ""
+
+resources:
+ limits:
+ cpu: 1000m
+ memory: 2Gi
+ requests:
+ cpu: 500m
+ memory: 1Gi
+</pre>
+<br />
+<span>**persistent-volumes.yaml** - Storage configuration:</span><br />
+<br />
+<pre>
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: tempo-data-pv
+spec:
+ capacity:
+ storage: 10Gi
+ accessModes:
+ - ReadWriteOnce
+ persistentVolumeReclaimPolicy: Retain
+ hostPath:
+ path: /data/nfs/k3svolumes/tempo/data
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+ name: tempo-data-pvc
+ namespace: monitoring
+spec:
+ storageClassName: ""
+ accessModes:
+ - ReadWriteOnce
+ resources:
+ requests:
+ storage: 10Gi
+</pre>
+<br />
+<span>**Grafana Datasource Provisioning**</span><br />
+<br />
+<span>All Grafana datasources (Prometheus, Alertmanager, Loki, Tempo) are provisioned via a unified ConfigMap that is directly mounted to the Grafana pod. This approach ensures datasources are loaded on startup without requiring sidecar-based discovery.</span><br />
+<br />
+<span>In /home/paul/git/conf/f3s/prometheus/grafana-datasources-all.yaml:</span><br />
+<br />
+<pre>
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: grafana-datasources-all
+ namespace: monitoring
+data:
+ datasources.yaml: |
+ apiVersion: 1
+ datasources:
+ - name: Prometheus
+ type: prometheus
+ uid: prometheus
+ url: http://prometheus-kube-prometheus-prometheus.monitoring:9090/
+ access: proxy
+ isDefault: true
+ - name: Alertmanager
+ type: alertmanager
+ uid: alertmanager
+ url: http://prometheus-kube-prometheus-alertmanager.monitoring:9093/
+ - name: Loki
+ type: loki
+ uid: loki
+ url: http://loki.monitoring.svc.cluster.local:3100
+ - name: Tempo
+ type: tempo
+ uid: tempo
+ url: http://tempo.monitoring.svc.cluster.local:3200
+ jsonData:
+ tracesToLogsV2:
+ datasourceUid: loki
+ spanStartTimeShift: -1h
+ spanEndTimeShift: 1h
+ tracesToMetrics:
+ datasourceUid: prometheus
+ serviceMap:
+ datasourceUid: prometheus
+ nodeGraph:
+ enabled: true
+</pre>
+<br />
+<span>The kube-prometheus-stack Helm values (persistence-values.yaml) are configured to:</span><br />
+<ul>
+<li>Disable sidecar-based datasource provisioning</li>
+<li>Mount grafana-datasources-all ConfigMap directly to /etc/grafana/provisioning/datasources/</li>
+</ul><br />
+<span>This direct mounting approach is simpler and more reliable than sidecar-based discovery.</span><br />
+<br />
+<span>#### Installation</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/tempo
+just install
+</pre>
+<br />
+<span>Verify Tempo is running:</span><br />
+<br />
+<pre>
+kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
+kubectl exec -n monitoring &lt;tempo-pod&gt; -- wget -qO- http://localhost:3200/ready
+</pre>
+<br />
+<h3 style='display: inline' id='configuring-grafana-alloy-for-trace-collection'>Configuring Grafana Alloy for Trace Collection</h3><br />
+<br />
+<span>Updated /home/paul/git/conf/f3s/loki/alloy-values.yaml to add OTLP receivers for traces while maintaining existing log collection.</span><br />
+<br />
+<span>#### OTLP Receiver Configuration</span><br />
+<br />
+<span>Added to Alloy configuration after the log collection pipeline:</span><br />
+<br />
+<pre>
+// OTLP receiver for traces via gRPC and HTTP
+otelcol.receiver.otlp "default" {
+ grpc {
+ endpoint = "0.0.0.0:4317"
+ }
+ http {
+ endpoint = "0.0.0.0:4318"
+ }
+ output {
+ traces = [otelcol.processor.batch.default.input]
+ }
+}
+
+// Batch processor for efficient trace forwarding
+otelcol.processor.batch "default" {
+ timeout = "5s"
+ send_batch_size = 100
+ send_batch_max_size = 200
+ output {
+ traces = [otelcol.exporter.otlp.tempo.input]
+ }
+}
+
+// OTLP exporter to send traces to Tempo
+otelcol.exporter.otlp "tempo" {
+ client {
+ endpoint = "tempo.monitoring.svc.cluster.local:4317"
+ tls {
+ insecure = true
+ }
+ compression = "gzip"
+ }
+}
+</pre>
+<br />
+<span>The batch processor reduces network overhead by accumulating spans before forwarding to Tempo.</span><br />
+<br />
+<span>#### Upgrade Alloy</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/loki
+just upgrade
+</pre>
+<br />
+<span>Verify OTLP receivers are listening:</span><br />
+<br />
+<pre>
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i "otlp.*receiver"
+kubectl exec -n monitoring &lt;alloy-pod&gt; -- netstat -ln | grep -E &#39;:(4317|4318)&#39;
+</pre>
+<br />
+<h3 style='display: inline' id='demo-tracing-application'>Demo Tracing Application</h3><br />
+<br />
+<span>Created a three-tier Python application to demonstrate distributed tracing in action.</span><br />
+<br />
+<span>#### Application Architecture</span><br />
+<br />
+<pre>
+User → Frontend (Flask:5000) → Middleware (Flask:5001) → Backend (Flask:5002)
+ ↓ ↓ ↓
+ Alloy (OTLP:4317) → Tempo → Grafana
+</pre>
+<br />
+<span>Frontend Service:</span><br />
+<br />
+<ul>
+<li>Receives HTTP requests at /api/process</li>
+<li>Forwards to middleware service</li>
+<li>Creates parent span for the entire request</li>
+</ul><br />
+<span>Middleware Service:</span><br />
+<br />
+<ul>
+<li>Transforms data at /api/transform</li>
+<li>Calls backend service</li>
+<li>Creates child span linked to frontend</li>
+</ul><br />
+<span>Backend Service:</span><br />
+<br />
+<ul>
+<li>Returns data at /api/data</li>
+<li>Simulates database query (100ms sleep)</li>
+<li>Creates leaf span in the trace</li>
+</ul><br />
+<span>OpenTelemetry Instrumentation:</span><br />
+<br />
+<span>All services use Python OpenTelemetry libraries:</span><br />
+<br />
+<span>**Dependencies:**</span><br />
+<pre>
+flask==3.0.0
+requests==2.31.0
+opentelemetry-distro==0.49b0
+opentelemetry-exporter-otlp==1.28.0
+opentelemetry-instrumentation-flask==0.49b0
+opentelemetry-instrumentation-requests==0.49b0
+</pre>
+<br />
+<span>**Auto-instrumentation pattern** (used in all services):</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#ababab">from</font><font color="#ff0000"> opentelemetry </font><font color="#ababab">import</font><font color="#ff0000"> trace</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">sdk</font><font color="#F3E651">.</font><font color="#ff0000">trace </font><font color="#ababab">import</font><font color="#ff0000"> TracerProvider</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">exporter</font><font color="#F3E651">.</font><font color="#ff0000">otlp</font><font color="#F3E651">.</font><font color="#ff0000">proto</font><font color="#F3E651">.</font><font color="#ff0000">grpc</font><font color="#F3E651">.</font><font color="#ff0000">trace_exporter </font><font color="#ababab">import</font><font color="#ff0000"> OTLPSpanExporter</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">instrumentation</font><font color="#F3E651">.</font><font color="#ff0000">flask </font><font color="#ababab">import</font><font color="#ff0000"> FlaskInstrumentor</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">instrumentation</font><font color="#F3E651">.</font><font color="#ff0000">requests </font><font color="#ababab">import</font><font color="#ff0000"> RequestsInstrumentor</font>
+<font color="#ababab">from</font><font color="#ff0000"> opentelemetry</font><font color="#F3E651">.</font><font color="#ff0000">sdk</font><font color="#F3E651">.</font><font color="#ff0000">resources </font><font color="#ababab">import</font><font color="#ff0000"> Resource</font>
+
+<i><font color="#ababab"># Define service identity</font></i>
+<font color="#ff0000">resource </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">Resource</font><font color="#F3E651">(</font><font color="#ff0000">attributes</font><font color="#F3E651">={</font>
+<font color="#ff0000"> </font><font color="#bb00ff">"service.name"</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#bb00ff">"frontend"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#bb00ff">"service.namespace"</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#bb00ff">"tracing-demo"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#bb00ff">"service.version"</font><font color="#F3E651">:</font><font color="#ff0000"> </font><font color="#bb00ff">"1.0.0"</font>
+<font color="#F3E651">})</font>
+
+<font color="#ff0000">provider </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">TracerProvider</font><font color="#F3E651">(</font><font color="#ff0000">resource</font><font color="#F3E651">=</font><font color="#ff0000">resource</font><font color="#F3E651">)</font>
+
+<i><font color="#ababab"># Export to Alloy</font></i>
+<font color="#ff0000">otlp_exporter </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">OTLPSpanExporter</font><font color="#F3E651">(</font>
+<font color="#ff0000"> endpoint</font><font color="#F3E651">=</font><font color="#bb00ff">"http://alloy.monitoring.svc.cluster.local:4317"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> insecure</font><font color="#F3E651">=</font><font color="#ff0000">True</font>
+<font color="#F3E651">)</font>
+
+<font color="#ff0000">processor </font><font color="#F3E651">=</font><font color="#ff0000"> </font><font color="#7bc710">BatchSpanProcessor</font><font color="#F3E651">(</font><font color="#ff0000">otlp_exporter</font><font color="#F3E651">)</font>
+<font color="#ff0000">provider</font><font color="#F3E651">.</font><font color="#7bc710">add_span_processor</font><font color="#F3E651">(</font><font color="#ff0000">processor</font><font color="#F3E651">)</font>
+<font color="#ff0000">trace</font><font color="#F3E651">.</font><font color="#7bc710">set_tracer_provider</font><font color="#F3E651">(</font><font color="#ff0000">provider</font><font color="#F3E651">)</font>
+
+<i><font color="#ababab"># Auto-instrument Flask and requests</font></i>
+<font color="#7bc710">FlaskInstrumentor</font><font color="#F3E651">().</font><font color="#7bc710">instrument_app</font><font color="#F3E651">(</font><font color="#ff0000">app</font><font color="#F3E651">)</font>
+<font color="#7bc710">RequestsInstrumentor</font><font color="#F3E651">().</font><font color="#7bc710">instrument</font><font color="#F3E651">()</font>
+</pre>
+<br />
+<span>The auto-instrumentation automatically:</span><br />
+<ul>
+<li>Creates spans for HTTP requests</li>
+<li>Propagates trace context via W3C Trace Context headers</li>
+<li>Links parent and child spans across service boundaries</li>
+</ul><br />
+<span>Deployment:</span><br />
+<br />
+<span>Created Helm chart in /home/paul/git/conf/f3s/tracing-demo/ with three separate deployments, services, and an ingress.</span><br />
+<br />
+<span>Build and deploy:</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/tracing-demo
+just build
+just import
+just install
+</pre>
+<br />
+<span>Verify deployment:</span><br />
+<br />
+<pre>
+kubectl get pods -n services | grep tracing-demo
+kubectl get ingress -n services tracing-demo-ingress
+</pre>
+<br />
+<span>Access the application at:</span><br />
+<br />
+<a class='textlink' href='http://tracing-demo.f3s.buetow.org'>http://tracing-demo.f3s.buetow.org</a><br />
+<br />
+<h3 style='display: inline' id='visualizing-traces-in-grafana'>Visualizing Traces in Grafana</h3><br />
+<br />
+<span>The Tempo datasource is automatically discovered by Grafana through the ConfigMap label.</span><br />
+<br />
+<span>#### Accessing Traces</span><br />
+<br />
+<span>Navigate to Grafana → Explore → Select "Tempo" datasource</span><br />
+<br />
+<span>**Search Interface:**</span><br />
+<ul>
+<li>Search by Trace ID</li>
+<li>Search by service name</li>
+<li>Search by tags</li>
+</ul><br />
+<span>**TraceQL Queries:**</span><br />
+<br />
+<span>Find all traces from demo app:</span><br />
+<pre>
+{ resource.service.namespace = "tracing-demo" }
+</pre>
+<br />
+<span>Find slow requests (&gt;200ms):</span><br />
+<pre>
+{ duration &gt; 200ms }
+</pre>
+<br />
+<span>Find traces from specific service:</span><br />
+<pre>
+{ resource.service.name = "frontend" }
+</pre>
+<br />
+<span>Find errors:</span><br />
+<pre>
+{ status = error }
+</pre>
+<br />
+<span>Complex query - frontend traces calling middleware:</span><br />
+<pre>
+{ resource.service.namespace = "tracing-demo" } &amp;&amp; { span.http.status_code &gt;= 500 }
+</pre>
+<br />
+<span>#### Service Graph Visualization</span><br />
+<br />
+<span>The service graph shows visual connections between services:</span><br />
+<br />
+<span>1. Navigate to Explore → Tempo</span><br />
+<span>2. Enable "Service Graph" view</span><br />
+<span>3. Shows: Frontend → Middleware → Backend with request rates</span><br />
+<br />
+<span>The service graph uses Prometheus metrics generated from trace data.</span><br />
+<br />
+<h3 style='display: inline' id='correlation-between-observability-signals'>Correlation Between Observability Signals</h3><br />
+<br />
+<span>Tempo integrates with Loki and Prometheus to provide unified observability.</span><br />
+<br />
+<span>#### Traces-to-Logs</span><br />
+<br />
+<span>Click on any span in a trace to see related logs:</span><br />
+<br />
+<span>1. View trace in Grafana</span><br />
+<span>2. Click on a span</span><br />
+<span>3. Select "Logs for this span"</span><br />
+<span>4. Loki shows logs filtered by:</span><br />
+<span> * Time range (span duration ± 1 hour)</span><br />
+<span> * Service name</span><br />
+<span> * Namespace</span><br />
+<span> * Pod</span><br />
+<br />
+<span>This helps correlate what the service was doing when the span was created.</span><br />
+<br />
+<span>#### Traces-to-Metrics</span><br />
+<br />
+<span>View Prometheus metrics for services in the trace:</span><br />
+<br />
+<span>1. View trace in Grafana</span><br />
+<span>2. Select "Metrics" tab</span><br />
+<span>3. Shows metrics like:</span><br />
+<span> * Request rate</span><br />
+<span> * Error rate</span><br />
+<span> * Duration percentiles</span><br />
+<br />
+<span>#### Logs-to-Traces</span><br />
+<br />
+<span>From logs, you can jump to related traces:</span><br />
+<br />
+<span>1. In Loki, logs that contain trace IDs are automatically linked</span><br />
+<span>2. Click the trace ID to view the full trace</span><br />
+<span>3. See the complete request flow</span><br />
+<br />
+<h3 style='display: inline' id='generating-traces-for-testing'>Generating Traces for Testing</h3><br />
+<br />
+<span>Test the demo application:</span><br />
+<br />
+<pre>
+curl http://tracing-demo.f3s.buetow.org/api/process
+</pre>
+<br />
+<span>Load test (generates 50 traces):</span><br />
+<br />
+<pre>
+cd /home/paul/git/conf/f3s/tracing-demo
+just load-test
+</pre>
+<br />
+<span>Each request creates a distributed trace spanning all three services.</span><br />
+<br />
+<h3 style='display: inline' id='verifying-the-complete-pipeline'>Verifying the Complete Pipeline</h3><br />
+<br />
+<span>Check the trace flow end-to-end:</span><br />
+<br />
+<span>**1. Application generates traces:**</span><br />
+<pre>
+kubectl logs -n services -l app=tracing-demo-frontend | grep -i trace
+</pre>
+<br />
+<span>**2. Alloy receives traces:**</span><br />
+<pre>
+kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i otlp
+</pre>
+<br />
+<span>**3. Tempo stores traces:**</span><br />
+<pre>
+kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i trace
+</pre>
+<br />
+<span>**4. Grafana displays traces:**</span><br />
+<span>Navigate to Explore → Tempo → Search for traces</span><br />
+<br />
+<h3 style='display: inline' id='practical-example-viewing-a-distributed-trace'>Practical Example: Viewing a Distributed Trace</h3><br />
+<br />
+<span>Let&#39;s generate a trace and examine it in Grafana.</span><br />
+<br />
+<span>**1. Generate a trace by calling the demo application:**</span><br />
+<br />
+<pre>
+curl -H "Host: tracing-demo.f3s.buetow.org" http://r0/api/process
+</pre>
+<br />
+<span>**Response (HTTP 200):**</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">middleware_response</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">backend_data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">id</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">12345</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">query_time_ms</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">100.0</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">timestamp</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">2025-12-28T18:35:01.064538</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">value</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">Sample data from backend service</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">service</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">backend</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">middleware_processed</font><font color="#ff0000">"</font><font color="#ff0000">: </font><b><font color="#ffffff">true</font></b><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">original_data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">source</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">GET request</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">transformation_time_ms</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">50</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">request_data</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">source</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">GET request</font><font color="#ff0000">"</font>
+<font color="#ff0000"> </font><font color="#F3E651">},</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">service</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">frontend</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">status</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">success</font><font color="#ff0000">"</font>
+<font color="#F3E651">}</font>
+</pre>
+<br />
+<span>**2. Find the trace in Tempo via API:**</span><br />
+<br />
+<span>After a few seconds (for batch export), search for recent traces:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ &#39;http://localhost:3200/api/search?tags=service.namespace%3Dtracing-demo&amp;limit=5&#39; 2&gt;/dev/null | \
+ python3 -m json.tool
+</pre>
+<br />
+<span>Returns traces including:</span><br />
+<br />
+<!-- Generator: GNU source-highlight 3.1.9
+by Lorenzo Bettini
+http://www.lorenzobettini.it
+http://www.gnu.org/software/src-highlite -->
+<pre><font color="#F3E651">{</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">traceID</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">4be1151c0bdcd5625ac7e02b98d95bd5</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">rootServiceName</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">frontend</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">rootTraceName</font><font color="#ff0000">"</font><font color="#ff0000">:</font><font color="#ff0000"> "</font><font color="#bb00ff">GET /api/process</font><font color="#ff0000">"</font><font color="#F3E651">,</font>
+<font color="#ff0000"> </font><font color="#ff0000">"</font><font color="#ff0000">durationMs</font><font color="#ff0000">"</font><font color="#ff0000">: </font><font color="#bb00ff">221</font>
+<font color="#F3E651">}</font>
+</pre>
+<br />
+<span>**3. Fetch complete trace details:**</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring tempo-0 -- wget -qO- \
+ &#39;http://localhost:3200/api/traces/4be1151c0bdcd5625ac7e02b98d95bd5&#39; 2&gt;/dev/null | \
+ python3 -m json.tool
+</pre>
+<br />
+<span>**Trace structure (8 spans across 3 services):**</span><br />
+<br />
+<pre>
+Trace ID: 4be1151c0bdcd5625ac7e02b98d95bd5
+Services: 3 (frontend, middleware, backend)
+
+Service: frontend
+ └─ GET /api/process 221.10ms (HTTP server span)
+ └─ frontend-process 216.23ms (custom business logic span)
+ └─ POST 209.97ms (HTTP client span to middleware)
+
+Service: middleware
+ └─ POST /api/transform 186.02ms (HTTP server span)
+ └─ middleware-transform 180.96ms (custom business logic span)
+ └─ GET 127.52ms (HTTP client span to backend)
+
+Service: backend
+ └─ GET /api/data 103.93ms (HTTP server span)
+ └─ backend-get-data 102.11ms (custom business logic span with 100ms sleep)
+</pre>
+<br />
+<span>**4. View the trace in Grafana UI:**</span><br />
+<br />
+<span>Navigate to: Grafana → Explore → Tempo datasource</span><br />
+<br />
+<span>Search using TraceQL:</span><br />
+<pre>
+{ resource.service.namespace = "tracing-demo" }
+</pre>
+<br />
+<span>Or directly open the trace by pasting the trace ID in the search box:</span><br />
+<pre>
+4be1151c0bdcd5625ac7e02b98d95bd5
+</pre>
+<br />
+<span>**5. Trace visualization:**</span><br />
+<br />
+<span>The trace waterfall view in Grafana shows the complete request flow with timing:</span><br />
+<br />
+<a href='./f3s-kubernetes-with-freebsd-part-8/grafana-tempo-trace.png'><img alt='Distributed trace visualization in Grafana Tempo showing Frontend → Middleware → Backend spans' title='Distributed trace visualization in Grafana Tempo showing Frontend → Middleware → Backend spans' src='./f3s-kubernetes-with-freebsd-part-8/grafana-tempo-trace.png' /></a><br />
+<br />
+<span>For additional examples of Tempo trace visualization, see also:</span><br />
+<br />
+<a class='textlink' href='https://foo.zone/gemfeed/2025-12-24-x-rag-observability-hackathon.html'>X-RAG Observability Hackathon (more Grafana Tempo screenshots)</a><br />
+<br />
+<span>The trace reveals the distributed request flow:</span><br />
+<br />
+<ul>
+<li>Frontend (221ms): Receives GET /api/process, executes business logic, calls middleware</li>
+<li>Middleware (186ms): Receives POST /api/transform, transforms data, calls backend</li>
+<li>Backend (104ms): Receives GET /api/data, simulates database query with 100ms sleep</li>
+<li>Total request time: 221ms end-to-end</li>
+<li>Span propagation: W3C Trace Context headers automatically link all spans</li>
+</ul><br />
+<span>**6. Service graph visualization:**</span><br />
+<br />
+<span>The service graph is automatically generated from traces and shows service dependencies. For examples of service graph visualization in Grafana, see the screenshots in the X-RAG Observability Hackathon blog post.</span><br />
+<br />
+<a class='textlink' href='./2025-12-24-x-rag-observability-hackathon.html'>X-RAG Observability Hackathon (includes service graph screenshots)</a><br />
+<br />
+<span>This visualization helps identify:</span><br />
+<br />
+<ul>
+<li>Request rates between services</li>
+<li>Average latency for each hop</li>
+<li>Error rates (if any)</li>
+<li>Service dependencies and communication patterns</li>
+</ul><br />
+<h3 style='display: inline' id='storage-and-retention'>Storage and Retention</h3><br />
+<br />
+<span>Monitor Tempo storage usage:</span><br />
+<br />
+<pre>
+kubectl exec -n monitoring &lt;tempo-pod&gt; -- df -h /var/tempo
+</pre>
+<br />
+<span>With 10Gi storage and 7-day retention, the system handles moderate trace volumes. If storage fills up:</span><br />
+<br />
+<ul>
+<li>Reduce retention to 72h (3 days)</li>
+<li>Implement sampling in Alloy</li>
+<li>Increase PV size</li>
+</ul><br />
+<h3 style='display: inline' id='configuration-files'>Configuration Files</h3><br />
+<br />
+<span>All configuration files are available on Codeberg:</span><br />
+<br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/tempo'>Tempo configuration</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/loki'>Alloy configuration (updated for traces)</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/tracing-demo'>Demo tracing application</a><br />
+<br />
<h2 style='display: inline' id='summary'>Summary</h2><br />
<br />
-<span>With Prometheus, Grafana, Loki, and Alloy deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:</span><br />
+<span>With Prometheus, Grafana, Loki, Alloy, and Tempo deployed, I now have complete visibility into the k3s cluster, the FreeBSD storage servers, and the OpenBSD edge relays:</span><br />
<br />
<ul>
-<li>metrics: Prometheus collects and stores time-series data from all components</li>
+<li>Metrics: Prometheus collects and stores time-series data from all components, including etcd and ZFS</li>
<li>Logs: Loki aggregates logs from all containers, searchable via Grafana</li>
-<li>Visualisation: Grafana provides dashboards and exploration tools</li>
+<li>Traces: Tempo provides distributed request tracing with service dependency mapping</li>
+<li>Visualisation: Grafana provides dashboards and exploration tools with correlation between all three signals</li>
<li>Alerting: Alertmanager can notify on conditions defined in Prometheus rules</li>
</ul><br />
<span>This observability stack runs entirely on the home lab infrastructure, with data persisted to the NFS share. It&#39;s lightweight enough for a three-node cluster but provides the same capabilities as production-grade setups.</span><br />
<br />
+<span>All configuration files are available on Codeberg:</span><br />
+<br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus'>Prometheus, Grafana, and recording rules configuration</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/loki'>Loki and Alloy configuration</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/tempo'>Tempo configuration</a><br />
+<a class='textlink' href='https://codeberg.org/snonux/conf/src/branch/master/f3s/tracing-demo'>Demo tracing application</a><br />
+<br />
<span>Other *BSD-related posts:</span><br />
<br />
<a class='textlink' href='./2025-12-07-f3s-kubernetes-with-freebsd-part-8.html'>2025-12-07 f3s: Kubernetes with FreeBSD - Part 8: Observability (You are currently reading this)</a><br />