more about the drft

author: Paul Buetow <paul@buetow.org> 2025-07-02 00:35:36 +0300
committer: Paul Buetow <paul@buetow.org> 2025-07-02 00:35:36 +0300
commit: 4ea9075d69f1411fffe0511ee3e78b930fbbafb8 (patch)
tree: d2c03061aabe60c1ba040f7330b76b5dfa6351ee
parent: 638aa9523d57c7071d311909ef805bdda80098b3 (diff)
2 files changed, 727 insertions, 52 deletions
diff --git a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi
index 6e3f9395..086179e6 100644
--- a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi
+++ b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi
@@ -23,6 +23,7 @@ This is the sixth blog post about the f3s series for self-hosting demands in a h
 * ⇢ ⇢ ⇢ Migrating Bhyve VMs to encrypted `bhyve` ZFS volume
 * ⇢ ⇢ CARP
 * ⇢ ⇢ ZFS Replication with zrepl
+* ⇢ ⇢ ⇢ Why zrepl instead of HAST?
 * ⇢ ⇢ ⇢ Installing zrepl
 * ⇢ ⇢ ⇢ Checking ZFS pools
 * ⇢ ⇢ ⇢ Configuring zrepl with WireGuard tunnel
@@ -31,6 +32,13 @@ This is the sixth blog post about the f3s series for self-hosting demands in a h
 * ⇢ ⇢ ⇢ Enabling and starting zrepl services
 * ⇢ ⇢ ⇢ Verifying replication
 * ⇢ ⇢ ⇢ Monitoring replication
+* ⇢ ⇢ ⇢ A note about the Bhyve VM replication
+* ⇢ ⇢ ⇢ Quick status check commands
+* ⇢ ⇢ ⇢ Verifying replication after reboot
+* ⇢ ⇢ ⇢ Important note about failover limitations
+* ⇢ ⇢ ⇢ Mounting the NFS datasets
+* ⇢ ⇢ ⇢ Failback scenario: Syncing changes from f1 back to f0
+* ⇢ ⇢ ⇢ Testing the failback scenario
 
 ## Introduction
 
@@ -202,11 +210,27 @@ reboot or run doas kldload carp0
 
 In this section, we'll set up automatic ZFS replication from f0 to f1 using zrepl. This ensures our data is replicated across nodes for redundancy.
 
+### Why zrepl instead of HAST?
+
+While HAST (Highly Available Storage) is FreeBSD's native solution for high-availability storage, I've chosen zrepl for several important reasons:
+
+1. **HAST can cause ZFS corruption**: HAST operates at the block level and doesn't understand ZFS's transactional semantics. During failover, in-flight transactions can lead to corrupted zpools. I've experienced this firsthand - the automatic failover would trigger while ZFS was still writing, resulting in an unmountable pool.
+
+2. **ZFS-aware replication**: zrepl understands ZFS datasets and snapshots. It replicates at the dataset level, ensuring each snapshot is a consistent point-in-time copy. This is fundamentally safer than block-level replication.
+
+3. **Snapshot history**: With zrepl, you get multiple recovery points (every 5 minutes in our setup). If corruption occurs, you can roll back to any previous snapshot. HAST only gives you the current state.
+
+4. **Easier recovery**: When something goes wrong with zrepl, you still have intact snapshots on both sides. With HAST, a corrupted primary often means a corrupted secondary too.
+
+5. **Network flexibility**: zrepl works over any TCP connection (in our case, WireGuard), while HAST requires dedicated network configuration.
+
+The 5-minute replication window is perfectly acceptable for my personal use cases. This isn't a high-frequency trading system or a real-time database - it's storage for personal projects, development work, and home lab experiments. Losing at most 5 minutes of work in a disaster scenario is a reasonable trade-off for the reliability and simplicity of snapshot-based replication.
+
 ### Installing zrepl
 
 First, install zrepl on both hosts:
 
-```sh
+```
 # On f0
 paul@f0:~ % doas pkg install -y zrepl
 
@@ -255,6 +279,13 @@ paul@f1:~ % ifconfig wg0 | grep inet
 
 ### Configuring zrepl on f0 (source)
 
+First, create a dedicated dataset for NFS data that will be replicated:
+
+```sh
+# Create the nfsdata dataset that will hold all data exposed via NFS
+paul@f0:~ % doas zfs create zdata/enc/nfsdata
+```
+
 Create the zrepl configuration on f0:
 
 ```sh
@@ -266,39 +297,40 @@ global:
       format: human
 
 jobs:
-  - name: "f0_to_f1"
+  - name: f0_to_f1
     type: push
     connect:
       type: tcp
       address: "192.168.2.131:8888"
-    filesystems: {
-      "zdata/enc": true
-    }
+    filesystems:
+      "zdata/enc/nfsdata": true
+      "zroot/bhyve/fedora": true
     send:
       encrypted: true
     snapshotting:
       type: periodic
       prefix: zrepl_
-      interval: 10m
+      interval: 5m
     pruning:
       keep_sender:
         - type: last_n
           count: 10
-        - type: grid
-          grid: 1x1h(keep=all) | 24x1h | 7x1d | 4x7d | 6x30d
-          regex: "^zrepl_.*"
       keep_receiver:
-        - type: grid
-          grid: 1x1h(keep=all) | 24x1h | 7x1d | 4x7d | 6x30d
-          regex: "^zrepl_.*"
+        - type: last_n
+          count: 10
 EOF
 ```
 
+Note: We're specifically replicating `zdata/enc/nfsdata` instead of the entire `zdata/enc` dataset. This dedicated dataset will contain all the data we later want to expose via NFS, keeping a clear separation between replicated NFS data and other local encrypted data.
+
 ### Configuring zrepl on f1 (sink)
 
 Create the zrepl configuration on f1:
 
 ```sh
+# First create a dedicated sink dataset
+paul@f1:~ % doas zfs create zdata/sink
+
 paul@f1:~ % doas tee /usr/local/etc/zrepl/zrepl.yml <<'EOF'
 global:
   logging:
@@ -317,7 +349,7 @@ jobs:
     recv:
       placeholder:
         encryption: inherit
-    root_fs: "zdata/enc"
+    root_fs: "zdata/sink"
 EOF
 ```
 
@@ -344,17 +376,31 @@ Starting zrepl.
 Check the replication status:
 
 ```sh
-# On f0, check zrepl status
-paul@f0:~ % doas zrepl status
+# On f0, check zrepl status (use raw mode for non-tty)
+paul@f0:~ % doas zrepl status --mode raw | grep -A2 "Replication"
+"Replication":{"StartAt":"2025-07-01T22:31:48.712143123+03:00"...
 
-# Check for zrepl snapshots
-paul@f0:~ % doas zfs list -t snapshot -r zdata/enc | grep zrepl
+# Check if services are running
+paul@f0:~ % doas service zrepl status
+zrepl is running as pid 2649.
 
-# On f1, verify the replicated datasets
-paul@f1:~ % doas zfs list -r zdata/enc
+paul@f1:~ % doas service zrepl status
+zrepl is running as pid 2574.
 
-# Check zrepl logs for any errors
-paul@f0:~ % doas tail -f /var/log/zrepl.log
+# Check for zrepl snapshots on source
+paul@f0:~ % doas zfs list -t snapshot -r zdata/enc | grep zrepl
+zdata/enc@zrepl_20250701_193148_000    0B      -   176K  -
+
+# On f1, verify the replicated datasets  
+paul@f1:~ % doas zfs list -r zdata | grep f0
+zdata/f0             576K   899G   200K  none
+zdata/f0/zdata       376K   899G   200K  none
+zdata/f0/zdata/enc   176K   899G   176K  none
+
+# Check replicated snapshots on f1
+paul@f1:~ % doas zfs list -t snapshot -r zdata | grep zrepl
+zdata/f0/zdata/enc@zrepl_20250701_193148_000     0B      -   176K  -
+zdata/f0/zdata/enc@zrepl_20250701_194148_000     0B      -   176K  -
 ```
 
 ### Monitoring replication
@@ -369,7 +415,331 @@ paul@f0:~ % doas zrepl status --mode interactive
 paul@f0:~ % doas zrepl status --job f0_to_f1
 ```
 
-With this setup, zdata/enc on f0 will be automatically replicated to f1 every 10 minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep recent snapshots while managing disk space efficiently.
+With this setup, both `zdata/enc/nfsdata` and `zroot/bhyve/fedora` on f0 will be automatically replicated to f1 every 5 minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep the last 10 snapshots while managing disk space efficiently.
+
+The replicated data appears on f1 under `zdata/sink/` with the source host and dataset hierarchy preserved:
+
+* `zdata/enc/nfsdata` → `zdata/sink/f0/zdata/enc/nfsdata`
+* `zroot/bhyve/fedora` → `zdata/sink/f0/zroot/bhyve/fedora`
+
+This is by design - zrepl preserves the complete path from the source to ensure there are no conflicts when replicating from multiple sources. The replication uses the WireGuard tunnel for secure, encrypted transport between nodes.
+
+### A note about the Bhyve VM replication
+
+While replicating a Bhyve VM (Fedora in this case) is slightly off-topic for the f3s series, I've included it here as it demonstrates zrepl's flexibility. This is a development VM I use occasionally to log in remotely for certain development tasks. Having it replicated ensures I have a backup copy available on f1 if needed.
+
+### Quick status check commands
+
+Here are the essential commands to monitor replication status:
+
+```sh
+# On the source node (f0) - check if replication is active
+paul@f0:~ % doas zrepl status --job f0_to_f1 | grep -E '(State|Last)'
+State: done
+LastError: 
+
+# List all zrepl snapshots on source
+paul@f0:~ % doas zfs list -t snapshot | grep zrepl
+zdata/enc/nfsdata@zrepl_20250701_202530_000             0B      -   200K  -
+zroot/bhyve/fedora@zrepl_20250701_202530_000            0B      -  2.97G  -
+
+# On the sink node (f1) - verify received datasets
+paul@f1:~ % doas zfs list -r zdata/sink
+NAME                                   USED  AVAIL  REFER  MOUNTPOINT
+zdata/sink                             3.0G   896G   200K  /data/sink
+zdata/sink/f0                          3.0G   896G   200K  none
+zdata/sink/f0/zdata                    472K   896G   200K  none
+zdata/sink/f0/zdata/enc                272K   896G   200K  none
+zdata/sink/f0/zdata/enc/nfsdata        176K   896G   176K  none
+zdata/sink/f0/zroot                    2.9G   896G   200K  none
+zdata/sink/f0/zroot/bhyve              2.9G   896G   200K  none
+zdata/sink/f0/zroot/bhyve/fedora       2.9G   896G  2.97G  none
+
+# Check received snapshots on sink
+paul@f1:~ % doas zfs list -t snapshot -r zdata/sink | grep zrepl | wc -l
+       3
+
+# Monitor replication progress in real-time (on source)
+paul@f0:~ % doas zrepl status --mode interactive
+
+# Check last replication time (on source)
+paul@f0:~ % doas zrepl status --job f0_to_f1 | grep -A1 "Replication"
+Replication:
+  Status: Idle (last run: 2025-07-01T22:41:48)
+
+# View zrepl logs for troubleshooting
+paul@f0:~ % doas tail -20 /var/log/zrepl.log | grep -E '(error|warn|replication)'
+```
+
+These commands provide a quick way to verify that:
+
+* Replication jobs are running without errors
+* Snapshots are being created on the source
+* Data is being received on the sink
+* The replication schedule is being followed
+
+### Verifying replication after reboot
+
+The zrepl service is configured to start automatically at boot. After rebooting both hosts:
+
+```sh
+paul@f0:~ % uptime
+11:17PM  up 1 min, 0 users, load averages: 0.16, 0.06, 0.02
+
+paul@f0:~ % doas service zrepl status
+zrepl is running as pid 2366.
+
+paul@f1:~ % doas service zrepl status
+zrepl is running as pid 2309.
+
+# Check that new snapshots are being created and replicated
+paul@f0:~ % doas zfs list -t snapshot | grep zrepl | tail -2
+zdata/enc/nfsdata@zrepl_20250701_202530_000                0B      -   200K  -
+zroot/bhyve/fedora@zrepl_20250701_202530_000               0B      -  2.97G  -
+
+paul@f1:~ % doas zfs list -t snapshot -r zdata/sink | grep 202530
+zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_202530_000      0B      -   176K  -
+zdata/sink/f0/zroot/bhyve/fedora@zrepl_20250701_202530_000     0B      -  2.97G  -
+```
+
+The timestamps confirm that replication resumed automatically after the reboot, ensuring continuous data protection.
+
+### Important note about failover limitations
+
+The current zrepl setup provides **backup/disaster recovery** but not automatic failover. The replicated datasets on f1 are not mounted by default (`mountpoint=none`). In case f0 fails:
+
+```sh
+# Manual steps needed on f1 to activate the replicated data:
+paul@f1:~ % doas zfs set mountpoint=/data/nfsdata zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
+```
+
+However, this creates a **split-brain problem**: when f0 comes back online, both systems would have diverged data. Resolving this requires careful manual intervention to:
+
+1. Stop the original replication
+2. Sync changes from f1 back to f0
+3. Re-establish normal replication
+
+For true high-availability NFS, you might consider:
+
+* **Shared storage** (like iSCSI) with proper clustering
+* **GlusterFS** or similar distributed filesystems
+* **Manual failover with ZFS replication** (as we have here)
+
+Note: While HAST+CARP is often suggested for HA storage, it can cause filesystem corruption in practice, especially with ZFS. The block-level replication of HAST doesn't understand ZFS's transactional model, leading to inconsistent states during failover. 
+
+The current zrepl setup, despite requiring manual intervention, is actually safer because:
+
+* ZFS snapshots are always consistent
+* Replication is ZFS-aware (not just block-level)
+* You have full control over the failover process
+* No risk of split-brain corruption
+
+### Mounting the NFS datasets
+
+To make the nfsdata accessible on both nodes, we need to mount them. On f0, this is straightforward:
+
+```sh
+# On f0 - set mountpoint for the primary nfsdata
+paul@f0:~ % doas zfs set mountpoint=/data/nfs zdata/enc/nfsdata
+paul@f0:~ % doas mkdir -p /data/nfs
+
+# Verify it's mounted
+paul@f0:~ % df -h /data/nfs
+Filesystem           Size    Used   Avail Capacity  Mounted on
+zdata/enc/nfsdata    899G    204K    899G     0%    /data/nfs
+```
+
+On f1, we need to handle the encryption key and mount the standby copy:
+
+```sh
+# On f1 - first check encryption status
+paul@f1:~ % doas zfs get keystatus zdata/sink/f0/zdata/enc/nfsdata
+NAME                             PROPERTY   VALUE        SOURCE
+zdata/sink/f0/zdata/enc/nfsdata  keystatus  unavailable  -
+
+# Load the encryption key (using f0's key stored on the USB)
+paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
+    zdata/sink/f0/zdata/enc/nfsdata
+
+# Set mountpoint and mount (same path as f0 for easier failover)
+paul@f1:~ % doas mkdir -p /data/nfs
+paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
+
+# Make it read-only to prevent accidental writes that would break replication
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+
+# Verify
+paul@f1:~ % df -h /data/nfs
+Filesystem                         Size    Used   Avail Capacity  Mounted on
+zdata/sink/f0/zdata/enc/nfsdata    896G    204K    896G     0%    /data/nfs
+```
+
+Note: The dataset is mounted at the same path (`/data/nfs`) on both hosts to simplify failover procedures. The dataset on f1 is set to `readonly=on` to prevent accidental modifications that would break replication.
+
+**CRITICAL WARNING**: Do NOT write to `/data/nfs/` on f1! Any modifications will break the replication. If you accidentally write to it, you'll see this error:
+
+```
+cannot receive incremental stream: destination zdata/sink/f0/zdata/enc/nfsdata has been modified
+since most recent snapshot
+```
+
+To fix a broken replication after accidental writes:
+```sh
+# Option 1: Rollback to the last common snapshot (loses local changes)
+paul@f1:~ % doas zfs rollback zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_204054_000
+
+# Option 2: Make it read-only to prevent accidents
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+```
+
+To ensure the encryption key is loaded automatically after reboot on f1:
+```sh
+paul@f1:~ % doas sysrc zfskeys_datasets="zdata/sink/f0/zdata/enc/nfsdata"
+```
+
+### Failback scenario: Syncing changes from f1 back to f0
+
+In a disaster recovery scenario where f0 has failed and f1 has taken over, you'll need to sync changes back when f0 returns. Here's how to failback:
+
+```sh
+# On f1: First, make the dataset writable (if it was readonly)
+paul@f1:~ % doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata
+
+# Create a snapshot of the current state
+paul@f1:~ % doas zfs snapshot zdata/sink/f0/zdata/enc/nfsdata@failback
+
+# On f0: Stop any services using the dataset
+paul@f0:~ % doas service nfsd stop  # If NFS is running
+
+# Send the snapshot from f1 to f0, forcing a rollback
+# This WILL DESTROY any data on f0 that's not on f1!
+paul@f1:~ % doas zfs send -R zdata/sink/f0/zdata/enc/nfsdata@failback | \
+    ssh f0 "doas zfs recv -F zdata/enc/nfsdata"
+
+# Alternative: If you want to see what would be received first
+paul@f1:~ % doas zfs send -R zdata/sink/f0/zdata/enc/nfsdata@failback | \
+    ssh f0 "doas zfs recv -nv -F zdata/enc/nfsdata"
+
+# After successful sync, on f0:
+paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@failback
+
+# On f1: Make it readonly again and destroy the failback snapshot
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@failback
+
+# Stop zrepl services first - CRITICAL!
+paul@f0:~ % doas service zrepl stop
+paul@f1:~ % doas service zrepl stop
+
+# Clean up any zrepl snapshots on f0
+paul@f0:~ % doas zfs list -t snapshot -r zdata/enc/nfsdata | grep zrepl | \
+    awk '{print $1}' | xargs -I {} doas zfs destroy {}
+
+# Clean up and destroy the entire replicated structure on f1
+# First release any holds
+paul@f1:~ % doas zfs holds -r zdata/sink/f0 | grep -v NAME | \
+    awk '{print $2, $1}' | while read tag snap; do 
+        doas zfs release "$tag" "$snap"
+    done
+
+# Then destroy the entire f0 tree
+paul@f1:~ % doas zfs destroy -rf zdata/sink/f0
+
+# Create parent dataset structure on f1
+paul@f1:~ % doas zfs create -p zdata/sink/f0/zdata/enc
+
+# Create a fresh manual snapshot to establish baseline
+paul@f0:~ % doas zfs snapshot zdata/enc/nfsdata@manual_baseline
+
+# Send this snapshot to f1
+paul@f0:~ % doas zfs send -w zdata/enc/nfsdata@manual_baseline | \
+    ssh f1 "doas zfs recv zdata/sink/f0/zdata/enc/nfsdata"
+
+# Clean up the manual snapshot
+paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@manual_baseline
+paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@manual_baseline
+
+# Set mountpoint and make readonly on f1
+paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+
+# Load encryption key and mount on f1
+paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
+    zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
+
+# Now restart zrepl services
+paul@f0:~ % doas service zrepl start
+paul@f1:~ % doas service zrepl start
+
+# Verify replication is working
+paul@f0:~ % doas zrepl status --job f0_to_f1
+```
+
+**Important notes about failback**:
+
+* The `-F` flag forces a rollback on f0, destroying any local changes
+* Replication often won't resume automatically after a forced receive
+* You must clean up old zrepl snapshots on both sides
+* Creating a manual snapshot helps re-establish the replication relationship
+* Always verify replication status after the failback procedure
+* The first replication after failback will be a full send of the current state
+
+### Testing the failback scenario
+
+Here's a real test of the failback procedure:
+
+```sh
+# Simulate failure: Stop replication on f0
+paul@f0:~ % doas service zrepl stop
+
+# On f1: Take over by making the dataset writable
+paul@f1:~ % doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata
+
+# Write some data on f1 during the "outage"
+paul@f1:~ % echo 'Data written on f1 during failover' | doas tee /data/nfs/failover-data.txt
+Data written on f1 during failover
+
+# Now perform failback when f0 comes back online
+# Create snapshot on f1
+paul@f1:~ % doas zfs snapshot zdata/sink/f0/zdata/enc/nfsdata@failback
+
+# Send data back to f0 (note: we had to send to a temporary dataset due to holds)
+paul@f1:~ % doas zfs send -Rw zdata/sink/f0/zdata/enc/nfsdata@failback | \
+    ssh f0 "doas zfs recv -F zdata/enc/nfsdata_temp"
+
+# On f0: Rename datasets to complete failback
+paul@f0:~ % doas zfs set mountpoint=none zdata/enc/nfsdata
+paul@f0:~ % doas zfs rename zdata/enc/nfsdata zdata/enc/nfsdata_old
+paul@f0:~ % doas zfs rename zdata/enc/nfsdata_temp zdata/enc/nfsdata
+
+# Load encryption key and mount
+paul@f0:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/enc/nfsdata
+paul@f0:~ % doas zfs mount zdata/enc/nfsdata
+
+# Verify the data from f1 is now on f0
+paul@f0:~ % ls -la /data/nfs/
+total 18
+drwxr-xr-x  2 root wheel  4 Jul  2 00:01 .
+drwxr-xr-x  4 root wheel  4 Jul  1 23:41 ..
+-rw-r--r--  1 root wheel 35 Jul  2 00:01 failover-data.txt
+-rw-r--r--  1 root wheel 12 Jul  1 23:34 hello.txt
+```
+
+Success! The failover data from f1 is now on f0. To resume normal replication, you would need to:
+
+1. Clean up old snapshots on both sides
+2. Create a new manual baseline snapshot
+3. Restart zrepl services
+
+**Key learnings from the test**:
+
+* The `-w` flag is essential for encrypted datasets
+* Dataset holds can complicate the process (consider sending to a temporary dataset)
+* The encryption key must be loaded after receiving the dataset
+* Always verify data integrity before resuming normal operations
 
 ZFS auto scrubbing....~?
 
diff --git a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl
index 64fa1cf1..60638faa 100644
--- a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl
+++ b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl
@@ -180,11 +180,27 @@ reboot or run doas kldload carp0
 
 In this section, we'll set up automatic ZFS replication from f0 to f1 using zrepl. This ensures our data is replicated across nodes for redundancy.
 
+### Why zrepl instead of HAST?
+
+While HAST (Highly Available Storage) is FreeBSD's native solution for high-availability storage, I've chosen zrepl for several important reasons:
+
+1. **HAST can cause ZFS corruption**: HAST operates at the block level and doesn't understand ZFS's transactional semantics. During failover, in-flight transactions can lead to corrupted zpools. I've experienced this firsthand - the automatic failover would trigger while ZFS was still writing, resulting in an unmountable pool.
+
+2. **ZFS-aware replication**: zrepl understands ZFS datasets and snapshots. It replicates at the dataset level, ensuring each snapshot is a consistent point-in-time copy. This is fundamentally safer than block-level replication.
+
+3. **Snapshot history**: With zrepl, you get multiple recovery points (every 5 minutes in our setup). If corruption occurs, you can roll back to any previous snapshot. HAST only gives you the current state.
+
+4. **Easier recovery**: When something goes wrong with zrepl, you still have intact snapshots on both sides. With HAST, a corrupted primary often means a corrupted secondary too.
+
+5. **Network flexibility**: zrepl works over any TCP connection (in our case, WireGuard), while HAST requires dedicated network configuration.
+
+The 5-minute replication window is perfectly acceptable for my personal use cases. This isn't a high-frequency trading system or a real-time database - it's storage for personal projects, development work, and home lab experiments. Losing at most 5 minutes of work in a disaster scenario is a reasonable trade-off for the reliability and simplicity of snapshot-based replication.
+
 ### Installing zrepl
 
 First, install zrepl on both hosts:
 
-```sh
+```
 # On f0
 paul@f0:~ % doas pkg install -y zrepl
 
@@ -233,6 +249,13 @@ paul@f1:~ % ifconfig wg0 | grep inet
 
 ### Configuring zrepl on f0 (source)
 
+First, create a dedicated dataset for NFS data that will be replicated:
+
+```sh
+# Create the nfsdata dataset that will hold all data exposed via NFS
+paul@f0:~ % doas zfs create zdata/enc/nfsdata
+```
+
 Create the zrepl configuration on f0:
 
 ```sh
@@ -250,13 +273,14 @@ jobs:
       type: tcp
       address: "192.168.2.131:8888"
     filesystems:
-      "zdata/enc": true
+      "zdata/enc/nfsdata": true
+      "zroot/bhyve/fedora": true
     send:
       encrypted: true
     snapshotting:
       type: periodic
       prefix: zrepl_
-      interval: 10m
+      interval: 5m
     pruning:
       keep_sender:
         - type: last_n
@@ -267,11 +291,16 @@ jobs:
 EOF
 ```
 
+Note: We're specifically replicating `zdata/enc/nfsdata` instead of the entire `zdata/enc` dataset. This dedicated dataset will contain all the data we later want to expose via NFS, keeping a clear separation between replicated NFS data and other local encrypted data.
+
 ### Configuring zrepl on f1 (sink)
 
 Create the zrepl configuration on f1:
 
 ```sh
+# First create a dedicated sink dataset
+paul@f1:~ % doas zfs create zdata/sink
+
 paul@f1:~ % doas tee /usr/local/etc/zrepl/zrepl.yml <<'EOF'
 global:
   logging:
@@ -290,7 +319,7 @@ jobs:
     recv:
       placeholder:
         encryption: inherit
-    root_fs: "zdata/enc"
+    root_fs: "zdata/sink"
 EOF
 ```
 
@@ -332,18 +361,16 @@ zrepl is running as pid 2574.
 paul@f0:~ % doas zfs list -t snapshot -r zdata/enc | grep zrepl
 zdata/enc@zrepl_20250701_193148_000    0B      -   176K  -
 
-# On f1, verify the replicated datasets
-paul@f1:~ % doas zfs list -r zdata/enc
-NAME                     USED  AVAIL  REFER  MOUNTPOINT
-zdata/enc                776K   899G   200K  /data/enc
-zdata/enc/f0             576K   899G   200K  none
-zdata/enc/f0/zdata       376K   899G   200K  none
-zdata/enc/f0/zdata/enc   176K   899G   176K  none
+# On f1, verify the replicated datasets  
+paul@f1:~ % doas zfs list -r zdata | grep f0
+zdata/f0             576K   899G   200K  none
+zdata/f0/zdata       376K   899G   200K  none
+zdata/f0/zdata/enc   176K   899G   176K  none
 
 # Check replicated snapshots on f1
-paul@f1:~ % doas zfs list -t snapshot -r zdata/enc
-NAME                                               USED  AVAIL  REFER  MOUNTPOINT
-zdata/enc/f0/zdata/enc@zrepl_20250701_193148_000     0B      -   176K  -
+paul@f1:~ % doas zfs list -t snapshot -r zdata | grep zrepl
+zdata/f0/zdata/enc@zrepl_20250701_193148_000     0B      -   176K  -
+zdata/f0/zdata/enc@zrepl_20250701_194148_000     0B      -   176K  -
 ```
 
 ### Monitoring replication
@@ -358,9 +385,18 @@ paul@f0:~ % doas zrepl status --mode interactive
 paul@f0:~ % doas zrepl status --job f0_to_f1
 ```
 
-With this setup, zdata/enc on f0 will be automatically replicated to f1 every 10 minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep the last 10 snapshots while managing disk space efficiently.
+With this setup, both `zdata/enc/nfsdata` and `zroot/bhyve/fedora` on f0 will be automatically replicated to f1 every 5 minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep the last 10 snapshots while managing disk space efficiently.
+
+The replicated data appears on f1 under `zdata/sink/` with the source host and dataset hierarchy preserved:
+
+* `zdata/enc/nfsdata` → `zdata/sink/f0/zdata/enc/nfsdata`
+* `zroot/bhyve/fedora` → `zdata/sink/f0/zroot/bhyve/fedora`
+
+This is by design - zrepl preserves the complete path from the source to ensure there are no conflicts when replicating from multiple sources. The replication uses the WireGuard tunnel for secure, encrypted transport between nodes.
+
+### A note about the Bhyve VM replication
 
-The replicated data appears on f1 under `zdata/enc/f0/zdata/enc` to maintain the source dataset hierarchy. The replication uses the WireGuard tunnel for secure, encrypted transport between nodes.
+While replicating a Bhyve VM (Fedora in this case) is slightly off-topic for the f3s series, I've included it here as it demonstrates zrepl's flexibility. This is a development VM I use occasionally to log in remotely for certain development tasks. Having it replicated ensures I have a backup copy available on f1 if needed.
 
 ### Quick status check commands
 
@@ -373,20 +409,25 @@ State: done
 LastError: 
 
 # List all zrepl snapshots on source
-paul@f0:~ % doas zfs list -t snapshot -r zdata/enc | grep zrepl
-zdata/enc@zrepl_20250701_193148_000    0B      -   176K  -
-zdata/enc@zrepl_20250701_194148_000    0B      -   176K  -
+paul@f0:~ % doas zfs list -t snapshot | grep zrepl
+zdata/enc/nfsdata@zrepl_20250701_202530_000             0B      -   200K  -
+zroot/bhyve/fedora@zrepl_20250701_202530_000            0B      -  2.97G  -
 
 # On the sink node (f1) - verify received datasets
-paul@f1:~ % doas zfs list -r zdata/enc/f0
-NAME                     USED  AVAIL  REFER  MOUNTPOINT
-zdata/enc/f0             576K   899G   200K  none
-zdata/enc/f0/zdata       376K   899G   200K  none
-zdata/enc/f0/zdata/enc   176K   899G   176K  none
+paul@f1:~ % doas zfs list -r zdata/sink
+NAME                                   USED  AVAIL  REFER  MOUNTPOINT
+zdata/sink                             3.0G   896G   200K  /data/sink
+zdata/sink/f0                          3.0G   896G   200K  none
+zdata/sink/f0/zdata                    472K   896G   200K  none
+zdata/sink/f0/zdata/enc                272K   896G   200K  none
+zdata/sink/f0/zdata/enc/nfsdata        176K   896G   176K  none
+zdata/sink/f0/zroot                    2.9G   896G   200K  none
+zdata/sink/f0/zroot/bhyve              2.9G   896G   200K  none
+zdata/sink/f0/zroot/bhyve/fedora       2.9G   896G  2.97G  none
 
 # Check received snapshots on sink
-paul@f1:~ % doas zfs list -t snapshot -r zdata/enc | wc -l
-       2
+paul@f1:~ % doas zfs list -t snapshot -r zdata/sink | grep zrepl | wc -l
+       3
 
 # Monitor replication progress in real-time (on source)
 paul@f0:~ % doas zrepl status --mode interactive
@@ -401,10 +442,274 @@ paul@f0:~ % doas tail -20 /var/log/zrepl.log | grep -E '(error|warn|replication)
 ```
 
 These commands provide a quick way to verify that:
-- Replication jobs are running without errors
-- Snapshots are being created on the source
-- Data is being received on the sink
-- The replication schedule is being followed
+
+* Replication jobs are running without errors
+* Snapshots are being created on the source
+* Data is being received on the sink
+* The replication schedule is being followed
+
+### Verifying replication after reboot
+
+The zrepl service is configured to start automatically at boot. After rebooting both hosts:
+
+```sh
+paul@f0:~ % uptime
+11:17PM  up 1 min, 0 users, load averages: 0.16, 0.06, 0.02
+
+paul@f0:~ % doas service zrepl status
+zrepl is running as pid 2366.
+
+paul@f1:~ % doas service zrepl status
+zrepl is running as pid 2309.
+
+# Check that new snapshots are being created and replicated
+paul@f0:~ % doas zfs list -t snapshot | grep zrepl | tail -2
+zdata/enc/nfsdata@zrepl_20250701_202530_000                0B      -   200K  -
+zroot/bhyve/fedora@zrepl_20250701_202530_000               0B      -  2.97G  -
+
+paul@f1:~ % doas zfs list -t snapshot -r zdata/sink | grep 202530
+zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_202530_000      0B      -   176K  -
+zdata/sink/f0/zroot/bhyve/fedora@zrepl_20250701_202530_000     0B      -  2.97G  -
+```
+
+The timestamps confirm that replication resumed automatically after the reboot, ensuring continuous data protection.
+
+### Important note about failover limitations
+
+The current zrepl setup provides **backup/disaster recovery** but not automatic failover. The replicated datasets on f1 are not mounted by default (`mountpoint=none`). In case f0 fails:
+
+```sh
+# Manual steps needed on f1 to activate the replicated data:
+paul@f1:~ % doas zfs set mountpoint=/data/nfsdata zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
+```
+
+However, this creates a **split-brain problem**: when f0 comes back online, both systems would have diverged data. Resolving this requires careful manual intervention to:
+
+1. Stop the original replication
+2. Sync changes from f1 back to f0
+3. Re-establish normal replication
+
+For true high-availability NFS, you might consider:
+
+* **Shared storage** (like iSCSI) with proper clustering
+* **GlusterFS** or similar distributed filesystems
+* **Manual failover with ZFS replication** (as we have here)
+
+Note: While HAST+CARP is often suggested for HA storage, it can cause filesystem corruption in practice, especially with ZFS. The block-level replication of HAST doesn't understand ZFS's transactional model, leading to inconsistent states during failover. 
+
+The current zrepl setup, despite requiring manual intervention, is actually safer because:
+
+* ZFS snapshots are always consistent
+* Replication is ZFS-aware (not just block-level)
+* You have full control over the failover process
+* No risk of split-brain corruption
+
+### Mounting the NFS datasets
+
+To make the nfsdata accessible on both nodes, we need to mount them. On f0, this is straightforward:
+
+```sh
+# On f0 - set mountpoint for the primary nfsdata
+paul@f0:~ % doas zfs set mountpoint=/data/nfs zdata/enc/nfsdata
+paul@f0:~ % doas mkdir -p /data/nfs
+
+# Verify it's mounted
+paul@f0:~ % df -h /data/nfs
+Filesystem           Size    Used   Avail Capacity  Mounted on
+zdata/enc/nfsdata    899G    204K    899G     0%    /data/nfs
+```
+
+On f1, we need to handle the encryption key and mount the standby copy:
+
+```sh
+# On f1 - first check encryption status
+paul@f1:~ % doas zfs get keystatus zdata/sink/f0/zdata/enc/nfsdata
+NAME                             PROPERTY   VALUE        SOURCE
+zdata/sink/f0/zdata/enc/nfsdata  keystatus  unavailable  -
+
+# Load the encryption key (using f0's key stored on the USB)
+paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
+    zdata/sink/f0/zdata/enc/nfsdata
+
+# Set mountpoint and mount (same path as f0 for easier failover)
+paul@f1:~ % doas mkdir -p /data/nfs
+paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
+
+# Make it read-only to prevent accidental writes that would break replication
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+
+# Verify
+paul@f1:~ % df -h /data/nfs
+Filesystem                         Size    Used   Avail Capacity  Mounted on
+zdata/sink/f0/zdata/enc/nfsdata    896G    204K    896G     0%    /data/nfs
+```
+
+Note: The dataset is mounted at the same path (`/data/nfs`) on both hosts to simplify failover procedures. The dataset on f1 is set to `readonly=on` to prevent accidental modifications that would break replication.
+
+**CRITICAL WARNING**: Do NOT write to `/data/nfs/` on f1! Any modifications will break the replication. If you accidentally write to it, you'll see this error:
+
+```
+cannot receive incremental stream: destination zdata/sink/f0/zdata/enc/nfsdata has been modified
+since most recent snapshot
+```
+
+To fix a broken replication after accidental writes:
+```sh
+# Option 1: Rollback to the last common snapshot (loses local changes)
+paul@f1:~ % doas zfs rollback zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_204054_000
+
+# Option 2: Make it read-only to prevent accidents
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+```
+
+To ensure the encryption key is loaded automatically after reboot on f1:
+```sh
+paul@f1:~ % doas sysrc zfskeys_datasets="zdata/sink/f0/zdata/enc/nfsdata"
+```
+
+### Failback scenario: Syncing changes from f1 back to f0
+
+In a disaster recovery scenario where f0 has failed and f1 has taken over, you'll need to sync changes back when f0 returns. Here's how to failback:
+
+```sh
+# On f1: First, make the dataset writable (if it was readonly)
+paul@f1:~ % doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata
+
+# Create a snapshot of the current state
+paul@f1:~ % doas zfs snapshot zdata/sink/f0/zdata/enc/nfsdata@failback
+
+# On f0: Stop any services using the dataset
+paul@f0:~ % doas service nfsd stop  # If NFS is running
+
+# Send the snapshot from f1 to f0, forcing a rollback
+# This WILL DESTROY any data on f0 that's not on f1!
+paul@f1:~ % doas zfs send -R zdata/sink/f0/zdata/enc/nfsdata@failback | \
+    ssh f0 "doas zfs recv -F zdata/enc/nfsdata"
+
+# Alternative: If you want to see what would be received first
+paul@f1:~ % doas zfs send -R zdata/sink/f0/zdata/enc/nfsdata@failback | \
+    ssh f0 "doas zfs recv -nv -F zdata/enc/nfsdata"
+
+# After successful sync, on f0:
+paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@failback
+
+# On f1: Make it readonly again and destroy the failback snapshot
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@failback
+
+# Stop zrepl services first - CRITICAL!
+paul@f0:~ % doas service zrepl stop
+paul@f1:~ % doas service zrepl stop
+
+# Clean up any zrepl snapshots on f0
+paul@f0:~ % doas zfs list -t snapshot -r zdata/enc/nfsdata | grep zrepl | \
+    awk '{print $1}' | xargs -I {} doas zfs destroy {}
+
+# Clean up and destroy the entire replicated structure on f1
+# First release any holds
+paul@f1:~ % doas zfs holds -r zdata/sink/f0 | grep -v NAME | \
+    awk '{print $2, $1}' | while read tag snap; do 
+        doas zfs release "$tag" "$snap"
+    done
+
+# Then destroy the entire f0 tree
+paul@f1:~ % doas zfs destroy -rf zdata/sink/f0
+
+# Create parent dataset structure on f1
+paul@f1:~ % doas zfs create -p zdata/sink/f0/zdata/enc
+
+# Create a fresh manual snapshot to establish baseline
+paul@f0:~ % doas zfs snapshot zdata/enc/nfsdata@manual_baseline
+
+# Send this snapshot to f1
+paul@f0:~ % doas zfs send -w zdata/enc/nfsdata@manual_baseline | \
+    ssh f1 "doas zfs recv zdata/sink/f0/zdata/enc/nfsdata"
+
+# Clean up the manual snapshot
+paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@manual_baseline
+paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@manual_baseline
+
+# Set mountpoint and make readonly on f1
+paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
+
+# Load encryption key and mount on f1
+paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
+    zdata/sink/f0/zdata/enc/nfsdata
+paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
+
+# Now restart zrepl services
+paul@f0:~ % doas service zrepl start
+paul@f1:~ % doas service zrepl start
+
+# Verify replication is working
+paul@f0:~ % doas zrepl status --job f0_to_f1
+```
+
+**Important notes about failback**:
+
+* The `-F` flag forces a rollback on f0, destroying any local changes
+* Replication often won't resume automatically after a forced receive
+* You must clean up old zrepl snapshots on both sides
+* Creating a manual snapshot helps re-establish the replication relationship
+* Always verify replication status after the failback procedure
+* The first replication after failback will be a full send of the current state
+
+### Testing the failback scenario
+
+Here's a real test of the failback procedure:
+
+```sh
+# Simulate failure: Stop replication on f0
+paul@f0:~ % doas service zrepl stop
+
+# On f1: Take over by making the dataset writable
+paul@f1:~ % doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata
+
+# Write some data on f1 during the "outage"
+paul@f1:~ % echo 'Data written on f1 during failover' | doas tee /data/nfs/failover-data.txt
+Data written on f1 during failover
+
+# Now perform failback when f0 comes back online
+# Create snapshot on f1
+paul@f1:~ % doas zfs snapshot zdata/sink/f0/zdata/enc/nfsdata@failback
+
+# Send data back to f0 (note: we had to send to a temporary dataset due to holds)
+paul@f1:~ % doas zfs send -Rw zdata/sink/f0/zdata/enc/nfsdata@failback | \
+    ssh f0 "doas zfs recv -F zdata/enc/nfsdata_temp"
+
+# On f0: Rename datasets to complete failback
+paul@f0:~ % doas zfs set mountpoint=none zdata/enc/nfsdata
+paul@f0:~ % doas zfs rename zdata/enc/nfsdata zdata/enc/nfsdata_old
+paul@f0:~ % doas zfs rename zdata/enc/nfsdata_temp zdata/enc/nfsdata
+
+# Load encryption key and mount
+paul@f0:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/enc/nfsdata
+paul@f0:~ % doas zfs mount zdata/enc/nfsdata
+
+# Verify the data from f1 is now on f0
+paul@f0:~ % ls -la /data/nfs/
+total 18
+drwxr-xr-x  2 root wheel  4 Jul  2 00:01 .
+drwxr-xr-x  4 root wheel  4 Jul  1 23:41 ..
+-rw-r--r--  1 root wheel 35 Jul  2 00:01 failover-data.txt
+-rw-r--r--  1 root wheel 12 Jul  1 23:34 hello.txt
+```
+
+Success! The failover data from f1 is now on f0. To resume normal replication, you would need to:
+
+1. Clean up old snapshots on both sides
+2. Create a new manual baseline snapshot
+3. Restart zrepl services
+
+**Key learnings from the test**:
+
+* The `-w` flag is essential for encrypted datasets
+* Dataset holds can complicate the process (consider sending to a temporary dataset)
+* The encryption key must be loaded after receiving the dataset
+* Always verify data integrity before resuming normal operations
 
 ZFS auto scrubbing....~?
author	Paul Buetow <paul@buetow.org>	2025-07-02 00:35:36 +0300
committer	Paul Buetow <paul@buetow.org>	2025-07-02 00:35:36 +0300
commit	4ea9075d69f1411fffe0511ee3e78b930fbbafb8 (patch)
tree	d2c03061aabe60c1ba040f7330b76b5dfa6351ee
parent	638aa9523d57c7071d311909ef805bdda80098b3 (diff)