diff options
| author | Paul Buetow <paul@buetow.org> | 2025-12-28 16:33:29 +0200 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2025-12-28 16:38:41 +0200 |
| commit | e542041d7f2da8bbfc76d4a2cadd693bcf2b8f49 (patch) | |
| tree | 9c8e5fed1079ab122bb86d6ae3e4d9d80a91b16f | |
| parent | 61652fd1a49cdebda4894de41b7333b2b572ac6b (diff) | |
Add draft: Distributed tracing with Grafana Tempo and Alloy
This blog post draft documents the integration of Grafana Tempo into the
f3s Kubernetes cluster's observability stack. It covers:
- Deploying Grafana Tempo in monolithic mode with OTLP receivers
- Configuring Grafana Alloy to collect and forward traces to Tempo
- Creating a three-tier Python demo application (Frontend → Middleware → Backend)
with OpenTelemetry instrumentation
- Correlating traces with logs (Loki) and metrics (Prometheus) in Grafana
- Using TraceQL to query and explore distributed traces
- Service graph visualization for understanding microservice dependencies
Part of the f3s FreeBSD + Kubernetes observability series.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| -rw-r--r-- | gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-X-OBSERVABILITY2.gmi.tpl | 791 |
1 files changed, 787 insertions, 4 deletions
diff --git a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-X-OBSERVABILITY2.gmi.tpl b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-X-OBSERVABILITY2.gmi.tpl index 4968211f..612e0fa5 100644 --- a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-X-OBSERVABILITY2.gmi.tpl +++ b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-X-OBSERVABILITY2.gmi.tpl @@ -119,13 +119,796 @@ grafana: runAsGroup: 911 ``` +## ZFS Monitoring for FreeBSD Servers + +The FreeBSD servers (f0, f1, f2) that provide NFS storage to the k3s cluster have ZFS filesystems. Monitoring ZFS performance is crucial for understanding storage performance and cache efficiency. + +### Node Exporter ZFS Collector + +The node_exporter running on each FreeBSD server (v1.9.1) includes a built-in ZFS collector that exposes metrics via sysctls. The ZFS collector is enabled by default and provides: + +* ARC (Adaptive Replacement Cache) statistics +* Cache hit/miss rates +* Memory usage and allocation +* MRU/MFU cache breakdown +* Data vs metadata distribution + +### Verifying ZFS Metrics + +On any FreeBSD server, check that ZFS metrics are being exposed: + +``` +paul@f0:~ % curl -s http://localhost:9100/metrics | grep node_zfs_arcstats | wc -l + 69 +``` + +The metrics are automatically scraped by Prometheus through the existing static configuration in additional-scrape-configs.yaml which targets all FreeBSD servers on port 9100 with the os: freebsd label. + +### ZFS Recording Rules + +Created recording rules for easier dashboard consumption in zfs-recording-rules.yaml: + +``` +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: freebsd-zfs-rules + namespace: monitoring + labels: + release: prometheus +spec: + groups: + - name: freebsd-zfs-arc + interval: 30s + rules: + - record: node_zfs_arc_hit_rate_percent + expr: | + 100 * ( + rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) / + (rate(node_zfs_arcstats_hits_total{os="freebsd"}[5m]) + + rate(node_zfs_arcstats_misses_total{os="freebsd"}[5m])) + ) + labels: + os: freebsd + - record: node_zfs_arc_memory_usage_percent + expr: | + 100 * ( + node_zfs_arcstats_size_bytes{os="freebsd"} / + node_zfs_arcstats_c_max_bytes{os="freebsd"} + ) + labels: + os: freebsd + # Additional rules for metadata %, target %, MRU/MFU %, etc. +``` + +These recording rules calculate: + +* ARC hit rate percentage +* ARC memory usage percentage (current vs maximum) +* ARC target percentage (target vs maximum) +* Metadata vs data percentages +* MRU vs MFU cache percentages +* Demand data and metadata hit rates + +### Grafana Dashboards + +Created two comprehensive ZFS monitoring dashboards (zfs-dashboards.yaml): + +**Dashboard 1: FreeBSD ZFS (per-host detailed view)** + +Includes variables to select: +* FreeBSD server (f0, f1, or f2) +* ZFS pool (zdata, zroot, or all) + +**Pool Overview Row:** +* Pool Capacity gauge (with thresholds: green <70%, yellow <85%, red >85%) +* Pool Health status (ONLINE/DEGRADED/FAULTED with color coding) +* Total Pool Size stat +* Free Space stat +* Pool Space Usage Over Time (stacked: used + free) +* Pool Capacity Trend time series + +**Dataset Statistics Row:** +* Table showing all datasets with columns: Pool, Dataset, Used, Available, Referenced +* Automatically filters by selected pool + +**ARC Cache Statistics Row:** +* ARC Hit Rate gauge (red <70%, yellow <90%, green >=90%) +* ARC Size time series (current, target, max) +* ARC Memory Usage percentage gauge +* ARC Hits vs Misses rate +* ARC Data vs Metadata stacked time series + +**Dashboard 2: FreeBSD ZFS Summary (cluster-wide overview)** + +**Cluster-Wide Pool Statistics Row:** +* Total Storage Capacity across all servers +* Total Used space +* Total Free space +* Average Pool Capacity gauge +* Pool Health Status (worst case across cluster) +* Total Pool Space Usage Over Time +* Per-Pool Capacity time series (all pools on all hosts) + +**Per-Host Pool Breakdown Row:** +* Bar gauge showing capacity by host and pool +* Table with all pools: Host, Pool, Size, Used, Free, Capacity %, Health + +**Cluster-Wide ARC Statistics Row:** +* Average ARC Hit Rate gauge across all hosts +* ARC Hit Rate by Host time series +* Total ARC Size Across Cluster +* Total ARC Hits vs Misses (cluster-wide sum) +* ARC Size by Host + +### Deployment + +Applied the resources to the cluster: + +``` +cd /home/paul/git/conf/f3s/prometheus +kubectl apply -f zfs-recording-rules.yaml +kubectl apply -f zfs-dashboards.yaml +``` + +Updated Justfile to include ZFS recording rules in install and upgrade targets: + +``` +install: + kubectl apply -f persistent-volumes.yaml + kubectl create secret generic additional-scrape-configs --from-file=additional-scrape-configs.yaml -n monitoring --dry-run=client -o yaml | kubectl apply -f - + helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring -f persistence-values.yaml + kubectl apply -f freebsd-recording-rules.yaml + kubectl apply -f openbsd-recording-rules.yaml + kubectl apply -f zfs-recording-rules.yaml + just -f grafana-ingress/Justfile install +``` + +### Verifying ZFS Metrics in Prometheus + +Check that ZFS metrics are being collected: + +``` +kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \ + wget -qO- 'http://localhost:9090/api/v1/query?query=node_zfs_arcstats_size_bytes' +``` + +Check recording rules are calculating correctly: + +``` +kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \ + wget -qO- 'http://localhost:9090/api/v1/query?query=node_zfs_arc_memory_usage_percent' +``` + +Example output shows memory usage percentage for each FreeBSD server: + +``` +"result":[ + {"metric":{"instance":"192.168.2.130:9100","os":"freebsd"},"value":[...,"37.58"]}, + {"metric":{"instance":"192.168.2.131:9100","os":"freebsd"},"value":[...,"12.85"]}, + {"metric":{"instance":"192.168.2.132:9100","os":"freebsd"},"value":[...,"13.44"]} +] +``` + +### Accessing the Dashboards + +The dashboards are automatically imported by the Grafana sidecar and accessible at: + +=> https://grafana.f3s.buetow.org + +Navigate to Dashboards and search for: +* "FreeBSD ZFS" - detailed per-host view with pool and dataset breakdowns +* "FreeBSD ZFS Summary" - cluster-wide overview of all ZFS storage + +### Key Metrics to Monitor + +**ARC Hit Rate:** Should typically be above 90% for optimal performance. Lower hit rates indicate the ARC cache is too small or workload has poor locality. + +**ARC Memory Usage:** Shows how much of the maximum ARC size is being used. If consistently at or near maximum, the ARC is effectively utilizing available memory. + +**Data vs Metadata:** Typically data should dominate, but workloads with many small files will show higher metadata percentages. + +**MRU vs MFU:** Most Recently Used vs Most Frequently Used cache. The ratio depends on workload characteristics. + +**Pool Capacity:** Monitor pool usage to ensure adequate free space. ZFS performance degrades when pools exceed 80% capacity. + +**Pool Health:** Should always show ONLINE (green). DEGRADED (yellow) indicates a disk issue requiring attention. FAULTED (red) requires immediate action. + +**Dataset Usage:** Track which datasets are consuming the most space to identify growth trends and plan capacity. + +### ZFS Pool and Dataset Metrics via Textfile Collector + +To complement the ARC statistics from node_exporter's built-in ZFS collector, I added pool capacity and dataset metrics using the textfile collector feature. + +Created a script at /usr/local/bin/zfs_pool_metrics.sh on each FreeBSD server: + +``` +#!/bin/sh +# ZFS Pool and Dataset Metrics Collector for Prometheus + +OUTPUT_FILE="/var/tmp/node_exporter/zfs_pools.prom.$$" +FINAL_FILE="/var/tmp/node_exporter/zfs_pools.prom" + +mkdir -p /var/tmp/node_exporter + +{ + # Pool metrics + echo "# HELP zfs_pool_size_bytes Total size of ZFS pool" + echo "# TYPE zfs_pool_size_bytes gauge" + echo "# HELP zfs_pool_allocated_bytes Allocated space in ZFS pool" + echo "# TYPE zfs_pool_allocated_bytes gauge" + echo "# HELP zfs_pool_free_bytes Free space in ZFS pool" + echo "# TYPE zfs_pool_free_bytes gauge" + echo "# HELP zfs_pool_capacity_percent Capacity percentage" + echo "# TYPE zfs_pool_capacity_percent gauge" + echo "# HELP zfs_pool_health Pool health (0=ONLINE, 1=DEGRADED, 2=FAULTED)" + echo "# TYPE zfs_pool_health gauge" + + zpool list -Hp -o name,size,allocated,free,capacity,health | \ + while IFS=$'\t' read name size alloc free cap health; do + case "$health" in + ONLINE) health_val=0 ;; + DEGRADED) health_val=1 ;; + FAULTED) health_val=2 ;; + *) health_val=6 ;; + esac + cap_num=$(echo "$cap" | sed 's/%//') + + echo "zfs_pool_size_bytes{pool=\"$name\"} $size" + echo "zfs_pool_allocated_bytes{pool=\"$name\"} $alloc" + echo "zfs_pool_free_bytes{pool=\"$name\"} $free" + echo "zfs_pool_capacity_percent{pool=\"$name\"} $cap_num" + echo "zfs_pool_health{pool=\"$name\"} $health_val" + done + + # Dataset metrics + echo "# HELP zfs_dataset_used_bytes Used space in dataset" + echo "# TYPE zfs_dataset_used_bytes gauge" + echo "# HELP zfs_dataset_available_bytes Available space" + echo "# TYPE zfs_dataset_available_bytes gauge" + echo "# HELP zfs_dataset_referenced_bytes Referenced space" + echo "# TYPE zfs_dataset_referenced_bytes gauge" + + zfs list -Hp -t filesystem -o name,used,available,referenced | \ + while IFS=$'\t' read name used avail ref; do + pool=$(echo "$name" | cut -d/ -f1) + echo "zfs_dataset_used_bytes{pool=\"$pool\",dataset=\"$name\"} $used" + echo "zfs_dataset_available_bytes{pool=\"$pool\",dataset=\"$name\"} $avail" + echo "zfs_dataset_referenced_bytes{pool=\"$pool\",dataset=\"$name\"} $ref" + done +} > "$OUTPUT_FILE" + +mv "$OUTPUT_FILE" "$FINAL_FILE" +``` + +Deployed to all FreeBSD servers: + +``` +for host in f0 f1 f2; do + scp /tmp/zfs_pool_metrics.sh paul@$host:/tmp/ + ssh paul@$host 'doas mv /tmp/zfs_pool_metrics.sh /usr/local/bin/ && \ + doas chmod +x /usr/local/bin/zfs_pool_metrics.sh' +done +``` + +Set up cron jobs to run every minute: + +``` +for host in f0 f1 f2; do + ssh paul@$host 'echo "* * * * * /usr/local/bin/zfs_pool_metrics.sh >/dev/null 2>&1" | \ + doas crontab -' +done +``` + +The textfile collector (already configured with --collector.textfile.directory=/var/tmp/node_exporter) automatically picks up the metrics. + +Verify metrics are being exposed: + +``` +paul@f0:~ % curl -s http://localhost:9100/metrics | grep "^zfs_pool" | head -5 +zfs_pool_allocated_bytes{pool="zdata"} 6.47622733824e+11 +zfs_pool_allocated_bytes{pool="zroot"} 5.3338578944e+10 +zfs_pool_capacity_percent{pool="zdata"} 64 +zfs_pool_capacity_percent{pool="zroot"} 10 +zfs_pool_free_bytes{pool="zdata"} 3.48809678848e+11 +``` + ## Summary -Enabled etcd metrics monitoring for the k3s embedded etcd by: +Enhanced the f3s cluster observability by: -* Adding etcd-expose-metrics: true to /etc/rancher/k3s/config.yaml on each control-plane node -* Configuring Prometheus to scrape etcd on port 2381 +* Enabling etcd metrics monitoring for the k3s embedded etcd +* Implementing comprehensive ZFS monitoring for FreeBSD storage servers +* Creating recording rules for calculated metrics (ARC hit rates, memory usage, etc.) +* Deploying Grafana dashboards for visualization +* Configuring automatic dashboard import via ConfigMap labels -The etcd dashboard now provides visibility into cluster health, leader elections, and Raft consensus metrics. +The monitoring stack now provides visibility into both cluster control plane health (etcd) and storage performance (ZFS). => https://codeberg.org/snonux/conf/src/branch/master/f3s/prometheus prometheus configuration on Codeberg + +## Distributed Tracing with Grafana Tempo + +After implementing logs (Loki) and metrics (Prometheus), the final pillar of observability is distributed tracing. Grafana Tempo provides distributed tracing capabilities that help understand request flows across microservices. + +### Why Distributed Tracing? + +In a microservices architecture, a single user request may traverse multiple services. Distributed tracing: + +* Tracks requests across service boundaries +* Identifies performance bottlenecks +* Visualizes service dependencies +* Correlates with logs and metrics +* Helps debug complex distributed systems + +### Deploying Grafana Tempo + +Tempo is deployed in monolithic mode, following the same pattern as Loki's SingleBinary deployment. + +#### Configuration Strategy + +**Deployment Mode:** Monolithic (all components in one process) +* Simpler operation than microservices mode +* Suitable for the cluster scale +* Consistent with Loki deployment pattern + +**Storage:** Filesystem backend using hostPath +* 10Gi storage at /data/nfs/k3svolumes/tempo/data +* 7-day retention (168h) +* Local storage is the only option for monolithic mode + +**OTLP Receivers:** Standard OpenTelemetry Protocol ports +* gRPC: 4317 +* HTTP: 4318 +* Bind to 0.0.0.0 to avoid Tempo 2.7+ localhost-only binding issue + +#### Tempo Deployment Files + +Created in /home/paul/git/conf/f3s/tempo/: + +**values.yaml** - Helm chart configuration: + +``` +tempo: + retention: 168h + storage: + trace: + backend: local + local: + path: /var/tempo/traces + wal: + path: /var/tempo/wal + receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + +persistence: + enabled: true + size: 10Gi + storageClassName: "" + +resources: + limits: + cpu: 1000m + memory: 2Gi + requests: + cpu: 500m + memory: 1Gi +``` + +**persistent-volumes.yaml** - Storage configuration: + +``` +apiVersion: v1 +kind: PersistentVolume +metadata: + name: tempo-data-pv +spec: + capacity: + storage: 10Gi + accessModes: + - ReadWriteOnce + persistentVolumeReclaimPolicy: Retain + hostPath: + path: /data/nfs/k3svolumes/tempo/data +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: tempo-data-pvc + namespace: monitoring +spec: + storageClassName: "" + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi +``` + +**datasource-configmap.yaml** - Grafana integration: + +``` +apiVersion: v1 +kind: ConfigMap +metadata: + name: tempo-grafana-datasource + namespace: monitoring + labels: + grafana_datasource: "1" +data: + tempo-datasource.yaml: |- + apiVersion: 1 + datasources: + - name: "Tempo" + type: tempo + uid: tempo + url: http://tempo.monitoring.svc.cluster.local:3200 + jsonData: + tracesToLogsV2: + datasourceUid: 'loki' + tracesToMetrics: + datasourceUid: 'prometheus' + serviceMap: + datasourceUid: 'prometheus' +``` + +The ConfigMap label grafana_datasource: "1" enables automatic discovery by the Grafana sidecar, just like the Prometheus datasource configuration. + +#### Installation + +``` +cd /home/paul/git/conf/f3s/tempo +just install +``` + +Verify Tempo is running: + +``` +kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo +kubectl exec -n monitoring <tempo-pod> -- wget -qO- http://localhost:3200/ready +``` + +### Configuring Grafana Alloy for Trace Collection + +Updated /home/paul/git/conf/f3s/loki/alloy-values.yaml to add OTLP receivers for traces while maintaining existing log collection. + +#### OTLP Receiver Configuration + +Added to Alloy configuration after the log collection pipeline: + +``` +// OTLP receiver for traces via gRPC and HTTP +otelcol.receiver.otlp "default" { + grpc { + endpoint = "0.0.0.0:4317" + } + http { + endpoint = "0.0.0.0:4318" + } + output { + traces = [otelcol.processor.batch.default.input] + } +} + +// Batch processor for efficient trace forwarding +otelcol.processor.batch "default" { + timeout = "5s" + send_batch_size = 100 + send_batch_max_size = 200 + output { + traces = [otelcol.exporter.otlp.tempo.input] + } +} + +// OTLP exporter to send traces to Tempo +otelcol.exporter.otlp "tempo" { + client { + endpoint = "tempo.monitoring.svc.cluster.local:4317" + tls { + insecure = true + } + compression = "gzip" + } +} +``` + +The batch processor reduces network overhead by accumulating spans before forwarding to Tempo. + +#### Upgrade Alloy + +``` +cd /home/paul/git/conf/f3s/loki +just upgrade +``` + +Verify OTLP receivers are listening: + +``` +kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i "otlp.*receiver" +kubectl exec -n monitoring <alloy-pod> -- netstat -ln | grep -E ':(4317|4318)' +``` + +### Demo Tracing Application + +Created a three-tier Python application to demonstrate distributed tracing in action. + +#### Application Architecture + +``` +User → Frontend (Flask:5000) → Middleware (Flask:5001) → Backend (Flask:5002) + ↓ ↓ ↓ + Alloy (OTLP:4317) → Tempo → Grafana +``` + +**Frontend Service:** +* Receives HTTP requests at /api/process +* Forwards to middleware service +* Creates parent span for the entire request + +**Middleware Service:** +* Transforms data at /api/transform +* Calls backend service +* Creates child span linked to frontend + +**Backend Service:** +* Returns data at /api/data +* Simulates database query (100ms sleep) +* Creates leaf span in the trace + +#### OpenTelemetry Instrumentation + +All services use Python OpenTelemetry libraries: + +**Dependencies:** +``` +flask==3.0.0 +requests==2.31.0 +opentelemetry-distro==0.49b0 +opentelemetry-exporter-otlp==1.28.0 +opentelemetry-instrumentation-flask==0.49b0 +opentelemetry-instrumentation-requests==0.49b0 +``` + +**Auto-instrumentation pattern** (used in all services): + +```python +from opentelemetry import trace +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter +from opentelemetry.instrumentation.flask import FlaskInstrumentor +from opentelemetry.instrumentation.requests import RequestsInstrumentor +from opentelemetry.sdk.resources import Resource + +# Define service identity +resource = Resource(attributes={ + "service.name": "frontend", + "service.namespace": "tracing-demo", + "service.version": "1.0.0" +}) + +provider = TracerProvider(resource=resource) + +# Export to Alloy +otlp_exporter = OTLPSpanExporter( + endpoint="http://alloy.monitoring.svc.cluster.local:4317", + insecure=True +) + +processor = BatchSpanProcessor(otlp_exporter) +provider.add_span_processor(processor) +trace.set_tracer_provider(provider) + +# Auto-instrument Flask and requests +FlaskInstrumentor().instrument_app(app) +RequestsInstrumentor().instrument() +``` + +The auto-instrumentation automatically: +* Creates spans for HTTP requests +* Propagates trace context via W3C Trace Context headers +* Links parent and child spans across service boundaries + +#### Deployment + +Created Helm chart in /home/paul/git/conf/f3s/tracing-demo/ with three separate deployments, services, and an ingress. + +Build and deploy: + +``` +cd /home/paul/git/conf/f3s/tracing-demo +just build +just import +just install +``` + +Verify deployment: + +``` +kubectl get pods -n services | grep tracing-demo +kubectl get ingress -n services tracing-demo-ingress +``` + +Access the application at: + +=> http://tracing-demo.f3s.buetow.org + +### Visualizing Traces in Grafana + +The Tempo datasource is automatically discovered by Grafana through the ConfigMap label. + +#### Accessing Traces + +Navigate to Grafana → Explore → Select "Tempo" datasource + +**Search Interface:** +* Search by Trace ID +* Search by service name +* Search by tags + +**TraceQL Queries:** + +Find all traces from demo app: +``` +{ resource.service.namespace = "tracing-demo" } +``` + +Find slow requests (>200ms): +``` +{ duration > 200ms } +``` + +Find traces from specific service: +``` +{ resource.service.name = "frontend" } +``` + +Find errors: +``` +{ status = error } +``` + +Complex query - frontend traces calling middleware: +``` +{ resource.service.namespace = "tracing-demo" } && { span.http.status_code >= 500 } +``` + +#### Service Graph Visualization + +The service graph shows visual connections between services: + +1. Navigate to Explore → Tempo +2. Enable "Service Graph" view +3. Shows: Frontend → Middleware → Backend with request rates + +The service graph uses Prometheus metrics generated from trace data. + +### Correlation Between Observability Signals + +Tempo integrates with Loki and Prometheus to provide unified observability. + +#### Traces-to-Logs + +Click on any span in a trace to see related logs: + +1. View trace in Grafana +2. Click on a span +3. Select "Logs for this span" +4. Loki shows logs filtered by: + * Time range (span duration ± 1 hour) + * Service name + * Namespace + * Pod + +This helps correlate what the service was doing when the span was created. + +#### Traces-to-Metrics + +View Prometheus metrics for services in the trace: + +1. View trace in Grafana +2. Select "Metrics" tab +3. Shows metrics like: + * Request rate + * Error rate + * Duration percentiles + +#### Logs-to-Traces + +From logs, you can jump to related traces: + +1. In Loki, logs that contain trace IDs are automatically linked +2. Click the trace ID to view the full trace +3. See the complete request flow + +### Generating Traces for Testing + +Test the demo application: + +``` +curl http://tracing-demo.f3s.buetow.org/api/process +``` + +Load test (generates 50 traces): + +``` +cd /home/paul/git/conf/f3s/tracing-demo +just load-test +``` + +Each request creates a distributed trace spanning all three services. + +### Verifying the Complete Pipeline + +Check the trace flow end-to-end: + +**1. Application generates traces:** +``` +kubectl logs -n services -l app=tracing-demo-frontend | grep -i trace +``` + +**2. Alloy receives traces:** +``` +kubectl logs -n monitoring -l app.kubernetes.io/name=alloy | grep -i otlp +``` + +**3. Tempo stores traces:** +``` +kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i trace +``` + +**4. Grafana displays traces:** +Navigate to Explore → Tempo → Search for traces + +### Storage and Retention + +Monitor Tempo storage usage: + +``` +kubectl exec -n monitoring <tempo-pod> -- df -h /var/tempo +``` + +With 10Gi storage and 7-day retention, the system handles moderate trace volumes. If storage fills up: + +* Reduce retention to 72h (3 days) +* Implement sampling in Alloy +* Increase PV size + +### Complete Observability Stack + +The f3s cluster now has complete observability: + +**Metrics** (Prometheus): +* Cluster resource usage +* Application metrics +* Node metrics (FreeBSD ZFS, OpenBSD edge) +* etcd health + +**Logs** (Loki): +* All pod logs +* Structured log collection +* Log aggregation and search + +**Traces** (Tempo): +* Distributed request tracing +* Service dependency mapping +* Performance profiling +* Error tracking + +**Visualization** (Grafana): +* Unified dashboards +* Correlation between metrics, logs, and traces +* Service graphs +* Alerts + +### Configuration Files + +All configuration files are available on Codeberg: + +=> https://codeberg.org/snonux/conf/src/branch/master/f3s/tempo Tempo configuration +=> https://codeberg.org/snonux/conf/src/branch/master/f3s/loki Alloy configuration (updated for traces) +=> https://codeberg.org/snonux/conf/src/branch/master/f3s/tracing-demo Demo tracing application |
