diff options
| author | Paul Buetow <paul@buetow.org> | 2025-07-13 16:05:45 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2025-07-13 16:05:45 +0300 |
| commit | 83614049a2ebc5ae59db268445f64df88617c0fc (patch) | |
| tree | ea33f012a35775490cae8d63a672df8046aabb75 /gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl | |
| parent | 90056e3db5aaec38ded59df48aa68956021fe4da (diff) | |
more about this
Diffstat (limited to 'gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl')
| -rw-r--r-- | gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl | 1139 |
1 files changed, 154 insertions, 985 deletions
diff --git a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl index a39de653..00be8d8d 100644 --- a/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl +++ b/gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl @@ -14,7 +14,7 @@ This is the sixth blog post about the f3s series for self-hosting demands in a h In the previous posts, we set up a FreeBSD-based Kubernetes cluster using k3s. While the base system works well, Kubernetes workloads often require persistent storage for databases, configuration files, and application data. Local storage on each node has significant limitations: -* No data sharing: Pods on different nodes can't access the same data +* No data sharing: Pods (once we run Kubernetes) on different nodes can't access the same data * Pod mobility: If a pod moves to another node, it loses access to its data * No redundancy: Hardware failure means data loss @@ -23,10 +23,12 @@ This post implements a robust storage solution using: * CARP: For high availability with automatic IP failover * NFS over stunnel: For secure, encrypted network storage * ZFS: For data integrity, encryption, and efficient snapshots -* zrepl: For continuous ZFS replication between nodes +* `zrepl`: For continuous ZFS replication between nodes The end result is a highly available, encrypted storage system that survives node failures while providing shared storage to all Kubernetes pods. +Other than what was mentioned in the first post of this blog series, we aren't using HAST, but `zrepl` for data replication. Read more about it later in this blog post. + ## Additional storage capacity We add to each of the nodes (`f0`, `f1`, `f2`) additional 1TB storage in form of an SSD drive. The Beelink mini PCs have enough space in the chassis for the additional space. @@ -139,7 +141,8 @@ After creation, these are copied to the other two nodes, `f1` and `f2`, into the Let's encrypt our `zdata` ZFS pool. Actually, we are not encrypting the whole pool, but everythig within the `zdata/enc` data set: ```sh -paul@f0:/keys % doas zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///keys/`hostname`:zdata.key zdata/enc +paul@f0:/keys % doas zfs create -o encryption=on -o keyformat=raw -o \ + keylocation=file:///keys/`hostname`:zdata.key zdata/enc paul@f0:/ % zfs list | grep zdata zdata 836K 899G 96K /data zdata/enc 200K 899G 200K /data/enc @@ -231,7 +234,7 @@ zroot/bhyve/rocky encryptionroot zroot/bhyve - zroot/bhyve/rocky keystatus available - ``` -## ZFS Replication with zrepl +## ZFS Replication with `zrepl` Data replication is the cornerstone of high availability. While CARP handles IP failover (see later in this post), we need continuous data replication to ensure the backup server has current data when it becomes active. Without replication, failover would result in data loss or require shared storage (like iSCSI), which introduces a single point of failure. @@ -239,23 +242,21 @@ Data replication is the cornerstone of high availability. While CARP handles IP Our storage system has different replication needs: -* NFS data (`/data/nfs/k3svolumes`): Contains active Kubernetes persistent volumes. Needs frequent replication (every minute) to minimize data loss during failover. +* NFS data (`/data/nfs/k3svolumes`): Soon, contains active Kubernetes persistent volumes. Needs frequent replication (every minute) to minimize data loss during failover. * VM data (`/zroot/bhyve/fedora`): Contains VM images that change less frequently. Can tolerate longer replication intervals (every 10 minutes). -The replication frequency determines your Recovery Point Objective (RPO) - the maximum acceptable data loss. With 1-minute replication, you lose at most 1 minute of changes during an unplanned failover. - -### Why `zrepl` instead of HAST? +The 1-minute replication window is perfectly acceptable for my personal use cases. This isn't a high-frequency trading system or a real-time database—it's storage for personal projects, development work, and home lab experiments. Losing at most 1 minute of work in a disaster scenario is a reasonable trade-off for the reliability and simplicity of snapshot-based replication. Also, in the case of "1 minute of data loss," I would very likely still have the data available on the client side. -While HAST (Highly Available Storage) is FreeBSD's native solution for high-availability storage, I've chosen `zrepl` for several important reasons: +Why use `zrepl` instead of HAST? While HAST (Highly Available Storage) is FreeBSD's native solution for high-availability storage and supports synchronous replication—thus eliminating the mentioned 1-minute window—I've chosen `zrepl` for several important reasons: * HAST can cause ZFS corruption: HAST operates at the block level and doesn't understand ZFS's transactional semantics. During failover, in-flight transactions can lead to corrupted zpools. I've experienced this firsthand (I am sure I might have configured something wrong) - the automatic failover would trigger while ZFS was still writing, resulting in an unmountable pool. * ZFS-aware replication: `zrepl` understands ZFS datasets and snapshots. It replicates at the dataset level, ensuring each snapshot is a consistent point-in-time copy. This is fundamentally safer than block-level replication. -* Snapshot history: With zrepl, you get multiple recovery points (every minute for NFS data in our setup). If corruption occurs, you can roll back to any previous snapshot. HAST only gives you the current state. -* Easier recovery: When something goes wrong with zrepl, you still have intact snapshots on both sides. With HAST, a corrupted primary often means a corrupted secondary too. +* Snapshot history: With `zrepl`, you get multiple recovery points (every minute for NFS data in our setup). If corruption occurs, you can roll back to any previous snapshot. HAST only gives you the current state. +* Easier recovery: When something goes wrong with `zrepl`, you still have intact snapshots on both sides. With HAST, a corrupted primary often means a corrupted secondary too. -The 1-minute replication window is perfectly acceptable for my personal use cases. This isn't a high-frequency trading system or a real-time database—it's storage for personal projects, development work, and home lab experiments. Losing at most 1 minute of work in a disaster scenario is a reasonable trade-off for the reliability and simplicity of snapshot-based replication. Also, in the case of "1 minute of data loss," I would very likely still have the data available on the client side. +=> https://wiki.freebsd.org/HighlyAvailableStorage FreeBSD HAST -### Installing zrepl +### Installing `zrepl` First, install `zrepl` on both hosts involved (we will replicate data from `f0` to `f1`): @@ -263,9 +264,7 @@ First, install `zrepl` on both hosts involved (we will replicate data from `f0` paul@f0:~ % doas pkg install -y zrepl ``` -### Checking ZFS pools - -Verify the pools and datasets on both hosts: +Then, we verify the pools and datasets on both hosts: ```sh # On f0 @@ -289,9 +288,7 @@ NAME USED AVAIL REFER MOUNTPOINT zdata/enc 200K 899G 200K /data/enc ``` -### Configuring `zrepl` with WireGuard tunnel - -Since we have a WireGuard tunnel between f0 and f1, we'll use TCP transport over the secure tunnel instead of SSH. First, check the WireGuard IP addresses: +Since we have a WireGuard tunnel between `f0` and f1, we'll use TCP transport over the secure tunnel instead of SSH. First, check the WireGuard IP addresses: ```sh # Check WireGuard interface IPs @@ -302,16 +299,14 @@ paul@f1:~ % ifconfig wg0 | grep inet inet 192.168.2.131 netmask 0xffffff00 ``` -### Configuring `zrepl` on f0 (source) - -First, create a dedicated dataset for NFS data that will be replicated: +Let's create a dedicated dataset for NFS data that will be replicated: ```sh # Create the nfsdata dataset that will hold all data exposed via NFS paul@f0:~ % doas zfs create zdata/enc/nfsdata ``` -Afterwards, we create the `zrepl` configuration on f0: +Afterwards, we create the `zrepl` configuration on `f0`: ```sh paul@f0:~ % doas tee /usr/local/etc/zrepl/zrepl.yml <<'EOF' @@ -371,7 +366,7 @@ EOF * `f0_to_f1_nfsdata`: Replicates NFS data every minute for faster failover recovery * `f0_to_f1_fedora`: Replicates Fedora VM every ten minutes (less critical) -The Fedora VM is only used for development purposes, so it doesn't require as frequent replication as the NFS data. It's off-topic to this blog series, but it showcases, hows zrepl's flexibility in handling different datasets with varying replication needs. +The Fedora VM is only used for development purposes, so it doesn't require as frequent replication as the NFS data. It's off-topic to this blog series, but it showcases, hows `zrepl`'s flexibility in handling different datasets with varying replication needs. Furthermore: @@ -380,7 +375,7 @@ Furthermore: ### Configuring `zrepl` on `f1` (sink) -On `f1` we configure `zrepl` to receive the data as follows: +On `f1` (the sink, meaning it's the node receiving the replication data), we configure `zrepl` to receive the data as follows: ```sh # First create a dedicated sink dataset @@ -410,7 +405,7 @@ EOF ### Enabling and starting `zrepl` services -Enable and start `zrepl` on both hosts: +We then enable and start `zrepl` on both hosts via: ```sh # On f0 @@ -426,8 +421,6 @@ paul@f1:~ % doas service `zrepl` start Starting zrepl. ``` -### Verifying replication - To check the replication status, we run: ```sh @@ -470,16 +463,16 @@ You can monitor the replication progress with: paul@f0:~ % doas zrepl status ``` -TODO: Add screenshot of the above.. +=> ./f3s-kubernetes-with-freebsd-part-6/zrepl.png zrepl status -With this setup, both `zdata/enc/nfsdata` and `zroot/bhyve/fedora` on f0 will be automatically replicated to f1 every 1 (or 10 in case of the Fedora VM) minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep the last 10 snapshots while managing disk space efficiently. +With this setup, both `zdata/enc/nfsdata` and `zroot/bhyve/fedora` on `f0` will be automatically replicated to `f1` every 1 (or 10 in case of the Fedora VM) minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep the last 10 snapshots while managing disk space efficiently. -The replicated data appears on f1 under `zdata/sink/` with the source host and dataset hierarchy preserved: +The replicated data appears on `f1` under `zdata/sink/` with the source host and dataset hierarchy preserved: * `zdata/enc/nfsdata` → `zdata/sink/f0/zdata/enc/nfsdata` * `zroot/bhyve/fedora` → `zdata/sink/f0/zroot/bhyve/fedora` -This is by design - `zrepl` preserves the complete path from the source to ensure there are no conflicts when replicating from multiple sources. The replication uses the WireGuard tunnel for secure, encrypted transport between nodes. +This is by design - `zrepl` preserves the complete path from the source to ensure there are no conflicts when replicating from multiple sources. ### Verifying replication after reboot @@ -511,25 +504,22 @@ zdata/sink/f0/zroot/bhyve/fedora@zrepl_20250701_202530_000 0B - 2.97G . ``` -The timestamps confirm that replication resumed automatically after the reboot, ensuring continuous data protection. +The timestamps confirm that replication resumed automatically after the reboot, ensuring continuous data protection. We can also write a test file to the NFS data directory on `f0` and see whether it shows up on `f1` after a minute. ### Understanding Failover Limitations and Design Decisions -#### Automatic failover only to a read-only - -This storage system intentionally failovers to a read-only copy of the replica in case the primary goes down. This is due to the nature that zrepl only replicates the data in one direction and if we mounted the data set on the sink node read-write, it would make the ZFS data-set diverge from the original and the replication would break. It can still be mounted read-write on the sink node in case of a real issue on the primary node, but that step is left intentional manualy. So we don't need to manually fix the replication later on. +Our system intentionally failovers to a read-only copy of the replica in case the primary goes down. This is due to the nature that `zrepl` only replicates the data in one direction and if we mounted the data set on the sink node read-write, it would make the ZFS data-set diverge from the original and the replication would break. It can still be mounted read-write on the sink node in case of a real issue on the primary node, but that step is left intentional manualy. So we don't need to manually fix the replication later on. So in summary: -* Split-brain prevention: Automatic failover can cause both nodes to become active simultaneously if network communication fails. This leads to data divergence that's extremely difficult to resolve. +* Split-brain prevention: Automatic failover to a read-write copy can cause both nodes to become active simultaneously if network communication fails. This leads to data divergence that's extremely difficult to resolve. * False positive protection: Temporary network issues or high load can trigger unwanted failovers. Manual intervention ensures failovers only occur when truly necessary. -* Data integrity over availability: For storage systems, data consistency is paramount. A few minutes of downtime is preferable to data corruption or loss. -* Simplified recovery: With manual failover, you always know which dataset is authoritative, making recovery straightforward. - +* Data integrity over availability: For storage systems, data consistency is paramount. A few minutes of downtime is preferable to data corruption in this specific use case. +* Simplified recovery: With manual failover, you always know which dataset is authoritative, making recovery more straightforward. ### Mounting the NFS datasets -To make the nfsdata accessible on both nodes, we need to mount them. On f0, this is straightforward: +To make the NFS data accessible on both nodes, we need to mount them. On `f0`, this is straightforward: ```sh # On f0 - set mountpoint for the primary nfsdata @@ -542,7 +532,7 @@ Filesystem Size Used Avail Capacity Mounted on zdata/enc/nfsdata 899G 204K 899G 0% /data/nfs ``` -On f1, we need to handle the encryption key and mount the standby copy: +On `f1`, we need to handle the encryption key and mount the standby copy: ```sh # On f1 - first check encryption status @@ -568,38 +558,32 @@ Filesystem Size Used Avail Capacity Mounted on zdata/sink/f0/zdata/enc/nfsdata 896G 204K 896G 0% /data/nfs ``` -Note: The dataset is mounted at the same path (`/data/nfs`) on both hosts to simplify failover procedures. The dataset on f1 is set to `readonly=on` to prevent accidental modifications that would break replication. +Note: The dataset is mounted at the same path (`/data/nfs`) on both hosts to simplify failover procedures. The dataset on `f1` is set to `readonly=on` to prevent accidental modifications that, as mentioned already, would break replication. If we did, replication from `f0` to `f1` would fail like this: -CRITICAL WARNING: Do NOT write to `/data/nfs/` on f1! Any modifications will break the replication. That's why it is mounted as read-only there (I have the feeloing I mentioned this in this blog post already!)! If you accidentally write to it, you'll see this error: +> cannot receive incremental stream: destination zdata/sink/f0/zdata/enc/nfsdata has been modified since most recent snapshot -``` -cannot receive incremental stream: destination zdata/sink/f0/zdata/enc/nfsdata has been modified -since most recent snapshot -``` +To fix a broken replication after accidental writes, we can do: -To fix a broken replication after accidental writes: ```sh # Option 1: Rollback to the last common snapshot (loses local changes) paul@f1:~ % doas zfs rollback zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_204054_000 -# Option 2: Make it read-only to prevent accidents +# Option 2: Make it read-only to prevent accidents again paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata ``` -### Troubleshooting: Files not appearing in replication +And replication should work again! -If you write files to `/data/nfs/` on f0 but they don't appear on f1, check: +### Troubleshooting: Files not appearing in replication -Is the dataset actually mounted on f0? +If you write files to `/data/nfs/` on `f0` but they don't appear on f1, check Is the dataset actually mounted on f0? ```sh paul@f0:~ % doas zfs list -o name,mountpoint,mounted | grep nfsdata zdata/enc/nfsdata /data/nfs yes ``` -If it shows `no`, the dataset isn't mounted! This means files are being written to the root filesystem, not ZFS. - -Check if encryption key is loaded: +If it shows `no`, the dataset isn't mounted! This means files are being written to the root filesystem, not ZFS. Next, we should chcek whether the encryption key is loaded: ```sh paul@f0:~ % doas zfs get keystatus zdata/enc/nfsdata @@ -610,7 +594,7 @@ paul@f0:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/ paul@f0:~ % doas zfs mount zdata/enc/nfsdata ``` -Verify files are in the snapshot (not just the directory): +You can also verify that files are in the snapshot (not just the directory): ```sh paul@f0:~ % ls -la /data/nfs/.zfs/snapshot/zrepl_*/ @@ -644,53 +628,31 @@ paul@f1:~ % doas zfs set keylocation=file:///keys/f0.lan.buetow.org:zdata.key zd Important notes: -* Each encryption root needs its own key load entry - child datasets don't inherit key loading -* The replicated dataset on f1 uses the same encryption key as the source on f0 +* Each encryption root needs its own key load entry +* The replicated dataset on `f1` uses the same encryption key as the source on `f0` * Always verify datasets are mounted after reboot with `zfs list -o name,mounted` -* Critical: Always ensure the replicated dataset on f1 remains read-only with `doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata` - -### Troubleshooting: Replication broken due to modified destination - -If you see the error "cannot receive incremental stream: destination has been modified since most recent snapshot", it means the read-only flag was accidentally removed on f1. To fix without a full resync: - -```sh -# Stop `zrepl` on both servers -paul@f0:~ % doas service `zrepl` stop -paul@f1:~ % doas service `zrepl` stop - -# Find the last common snapshot -paul@f0:~ % doas zfs list -t snapshot -o name,creation zdata/enc/nfsdata -paul@f1:~ % doas zfs list -t snapshot -o name,creation zdata/sink/f0/zdata/enc/nfsdata - -# Rollback f1 to the last common snapshot (example: @zrepl_20250705_000007_000) -paul@f1:~ % doas zfs rollback -r zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250705_000007_000 - -# Ensure the dataset is read-only -paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata - -# Restart zrepl -paul@f0:~ % doas service `zrepl` start -paul@f1:~ % doas service `zrepl` start -``` +* Critical: Always ensure the replicated dataset on `f1` remains read-only with `doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata` ## CARP (Common Address Redundancy Protocol) -High availability is crucial for storage systems. If the storage server goes down, all pods lose access to their persistent data. CARP provides a solution by creating a virtual IP address that automatically moves between servers during failures. +High availability is crucial for storage systems. If the storage server goes down, all NFS clients (which will also be Kubernetes pods later on in this series) lose access to their persistent data. CARP provides a solution by creating a virtual IP address that automatically moves between servers during failures. This means that clients point to that VIP for NFS mounts and are always contacting the current master node. ### How CARP Works -CARP allows two hosts to share a virtual IP address (VIP). The hosts communicate using multicast to elect a MASTER, while the other remain as BACKUP. When the MASTER fails, a BACKUP automatically promotes itself, and the VIP moves to the new MASTER. This happens within seconds. +In our case, CARP allows two hosts (`f0` and `f1`) to share a virtual IP address (VIP). The hosts communicate using multicast to elect a MASTER, while the other remain as BACKUP. When the MASTER fails, a BACKUP automatically promotes itself, and the VIP moves to the new MASTER. This happens within seconds. Key benefits for our storage system: -* Automatic failover: No manual intervention is required for basic failures, although there are a few limitations. The backup will only have read-only access to the available data, as we will learn later. However, we could manually promote it to read-write if needed. +* Automatic failover: No manual intervention is required for basic failures, although there are a few limitations. The backup will only have read-only access by default to the available data, as we learned already. * Transparent to clients: Pods continue using the same IP address -* Works with stunnel: Behind the VIP there will be a `stunnel` process running, which ensures encrypted connections follow the active server -* Simple configuration +* Works with `stunnel`: Behind the VIP there will be a `stunnel` process running, which ensures encrypted connections follow the active server. + +=> https://docs-archive.freebsd.org/doc/13.0-RELEASE/usr/local/share/doc/freebsd/en/books/handbook/carp.html FreeBSD CARP +=> https://www.stunnel.org/ Stunnel ### Configuring CARP -First, add the CARP configuration to `/etc/rc.conf` on both f0 and f1: +First, we add the CARP configuration to `/etc/rc.conf` on both `f0` and `f1`: ```sh # The virtual IP 192.168.1.138 will float between f0 and f1 @@ -715,8 +677,9 @@ This allows clients to connect to `f3s-storage-ha` regardless of which physical To properly manage services during failover, we need to detect CARP state changes. FreeBSD's devd system can notify us when CARP transitions between MASTER and BACKUP states. -Add this to `/etc/devd.conf` on both f0 and f1: +Add this to `/etc/devd.conf` on both `f0` and `f1`: +```sh paul@f0:~ % cat <<END | doas tee -a /etc/devd.conf notify 0 { match "system" "CARP"; @@ -726,7 +689,10 @@ notify 0 { }; END -Next, create the CARP control script that will restart stunnel when CARP state changes: +paul@f0:~ % doas service devd restart +``` + +Next, we create the CARP control script that will restart stunnel when CARP state changes: ```sh paul@f0:~ % doas tee /usr/local/bin/carpcontrol.sh <<'EOF' @@ -736,20 +702,9 @@ paul@f0:~ % doas tee /usr/local/bin/carpcontrol.sh <<'EOF' case "$1" in MASTER) logger "CARP state changed to MASTER, starting services" - service rpcbind start >/dev/null 2>&1 - service mountd start >/dev/null 2>&1 - service nfsd start >/dev/null 2>&1 - service nfsuserd start >/dev/null 2>&1 - service stunnel restart >/dev/null 2>&1 - logger "CARP MASTER: NFS and stunnel services started" ;; BACKUP) logger "CARP state changed to BACKUP, stopping services" - service stunnel stop >/dev/null 2>&1 - service nfsd stop >/dev/null 2>&1 - service mountd stop >/dev/null 2>&1 - service nfsuserd stop >/dev/null 2>&1 - logger "CARP BACKUP: NFS and stunnel services stopped" ;; *) logger "CARP state changed to $1 (unhandled)" @@ -765,9 +720,9 @@ paul@f1:~ % doas mv /tmp/carpcontrol.sh /usr/local/bin/ paul@f1:~ % doas chmod +x /usr/local/bin/carpcontrol.sh ``` -Note that we perform several tasks in the `carpcontrol.sh` script, which starts and/or stops all the services required for an NFS server running over an encrypted tunnel (via `stunnel`). We will set up all those services later in this blog post! +Note, that `carpcontrol.sh` doesn't do anything useful yet. We will fill in some more details (starting and stopping services upon failover) later in this blog post. -To enable CARP in /boot/loader.conf, run: +To enable CARP in `/boot/loader.conf`, run: ```sh paul@f0:~ % echo 'carp_load="YES"' | doas tee -a /boot/loader.conf @@ -778,34 +733,11 @@ carp_load="YES" Then reboot both hosts or run `doas kldload carp` to load the module immediately. -## Future Storage Explorations - -While `zrepl` provides excellent snapshot-based replication for disaster recovery, there are other storage technologies worth exploring for the f3s project: - -### MinIO for S3-Compatible Object Storage - -MinIO is a high-performance, S3-compatible object storage system that could complement our ZFS-based storage. Some potential use cases: - -* S3 API compatibility: Many modern applications expect S3-style object storage APIs. MinIO could provide this interface while using our ZFS storage as the backend. -* Multi-site replication: MinIO supports active-active replication across multiple sites, which could work well with our f0/f1/f2 node setup. -* Kubernetes native: MinIO has excellent Kubernetes integration with operators and CSI drivers, making it ideal for the f3s k3s environment. - -### MooseFS for Distributed High Availability - -MooseFS is a fault-tolerant, distributed file system that could provide true high-availability storage: - -* True HA: Unlike our current setup which requires manual failover, MooseFS provides automatic failover with no single point of failure. -* POSIX compliance: Applications can use MooseFS like any regular filesystem, no code changes needed. -* Flexible redundancy: Configure different replication levels per directory or file, optimizing storage efficiency. -* FreeBSD support: MooseFS has native FreeBSD support, making it a natural fit for the f3s project. - -Both technologies could potentially run on top of our encrypted ZFS volumes, combining ZFS's data integrity and encryption features with distributed storage capabilities. This would be particularly interesting for workloads that need either S3-compatible APIs (MinIO) or transparent distributed POSIX storage (MooseFS). - ## NFS Server Configuration -With ZFS replication in place, we can now set up NFS servers on both f0 and f1 to export the replicated data. Since native NFS over TLS (RFC 9289) has compatibility issues between Linux and FreeBSD, we'll use stunnel to provide encryption. +With ZFS replication in place, we can now set up NFS servers on both `f0` and `f1` to export the replicated data. Since native NFS over TLS (RFC 9289) has compatibility issues between Linux and FreeBSD (not digging into the details here, but I couldn't get it to work), we'll use stunnel to provide encryption. -### Setting up NFS on f0 (Primary) +### Setting up NFS on `f0` (Primary) First, enable the NFS services in rc.conf: @@ -822,7 +754,7 @@ paul@f0:~ % doas sysrc rpcbind_enable=YES rpcbind_enable: NO -> YES ``` -Create a dedicated directory for Kubernetes volumes: +And we also create a dedicated directory for Kubernetes volumes: ```sh # First ensure the dataset is mounted @@ -833,11 +765,9 @@ zdata/enc/nfsdata mounted yes - # Create the k3svolumes directory paul@f0:~ % doas mkdir -p /data/nfs/k3svolumes paul@f0:~ % doas chmod 755 /data/nfs/k3svolumes - -# This directory will be replicated to f1 automatically ``` -Create the /etc/exports file. Since we're using stunnel for encryption, ALL clients must connect through stunnel, which appears as localhost (127.0.0.1) to the NFS server: +We also create the `/etc/exports` file. Since we're using stunnel for encryption, ALL clients must connect through stunnel, which appears as localhost (`127.0.0.1`) to the NFS server: ```sh paul@f0:~ % doas tee /etc/exports <<'EOF' @@ -849,16 +779,10 @@ EOF The exports configuration: * `V4: /data/nfs -sec=sys`: Sets the NFSv4 root directory to /data/nfs -* `/data/nfs -alldirs`: Allows mounting any subdirectory under /data/nfs -* `-maproot=root`: Maps root user from client to root on server (needed for Kubernetes and ownership changes) -* `-network 127.0.0.1`: Only accepts connections from localhost (stunnel) +* `-maproot=root`: Maps root user from client to root on server +* `-network 127.0.0.1`: Only accepts connections from localhost (`stunnel`) -Note: -* ALL clients (r0, r1, r2, laptop) must connect through stunnel for encryption -* Stunnel proxies connections through localhost, so only 127.0.0.1 needs access -* With NFSv4, clients mount using relative paths (e.g., `/k3svolumes` instead of `/data/nfs/k3svolumes`) - -Start the NFS services: +To start the NFS services, we run: ```sh paul@f0:~ % doas service rpcbind start @@ -873,17 +797,16 @@ Starting nfsuserd. ### Configuring Stunnel for NFS Encryption with CARP Failover -#### Why Not Native NFS over TLS? - -FreeBSD 13+ supports native NFS over TLS (RFC 9289), which would be the ideal solution. However, there are significant compatibility challenges: - -* Linux client support is incomplete: Most Linux distributions don't fully support NFS over TLS yet -* Certificate management differs: FreeBSD and Linux handle TLS certificates differently for NFS -* Kernel module requirements: Requires specific kernel modules that may not be available - -Stunnel provides a more compatible solution that works reliably across all operating systems while offering equivalent security. +Using stunnel with client certificate authentication for NFS encryption provides several advantages: -#### Stunnel Architecture with CARP +* Compatibility: Works with any NFS version and between different operating systems +* Strong encryption: Uses TLS/SSL with configurable cipher suites +* Transparent: Applications don't need modification, encryption happens at transport layer +* Performance: Minimal overhead (~2% in benchmarks) +* Flexibility: Can encrypt any TCP-based protocol, not just NFS +* Strong Authentication: Client certificates provide cryptographic proof of identity +* Access Control: Only clients with valid certificates signed by your CA can connect +* Certificate Revocation: You can revoke access by removing certificates from the CA Stunnel integrates seamlessly with our CARP setup: @@ -899,7 +822,7 @@ Stunnel integrates seamlessly with our CARP setup: The key insight is that stunnel binds to the CARP VIP. When CARP fails over, the VIP moves to the new MASTER, and stunnel starts there automatically. Clients maintain their connection to the same IP throughout. -#### Creating a Certificate Authority for Client Authentication +### Creating a Certificate Authority for Client Authentication First, create a CA to sign both server and client certificates: @@ -930,7 +853,7 @@ paul@f0:~ % doas sh -c 'for client in r0 r1 r2 earth; do done' ``` -#### Install and Configure Stunnel on f0 +### Install and Configure Stunnel on `f0` ```sh # Install stunnel @@ -965,13 +888,14 @@ Starting stunnel. ``` The configuration includes: + * `verify = 2`: Verify client certificate and fail if not provided * `requireCert = yes`: Client must present a valid certificate * `CAfile`: Path to the CA certificate that signed the client certificates -### Setting up NFS on f1 (Standby) +### Setting up NFS on `f1` (Standby) -Repeat the same configuration on f1: +Repeat the same configuration on `f1`: ```sh paul@f1:~ % doas sysrc nfs_server_enable=YES @@ -1000,7 +924,7 @@ paul@f1:~ % doas service nfsuserd start Starting nfsuserd. ``` -Configure stunnel on f1: +And to configure stunnel on `f1`, we run: ```sh # Install stunnel @@ -1009,6 +933,7 @@ paul@f1:~ % doas pkg install -y stunnel # Copy certificates from f0 paul@f0:~ % doas tar -cf /tmp/stunnel-certs.tar -C /usr/local/etc/stunnel server-cert.pem server-key.pem ca paul@f0:~ % scp /tmp/stunnel-certs.tar f1:/tmp/ + paul@f1:~ % cd /usr/local/etc/stunnel && doas tar -xf /tmp/stunnel-certs.tar # Configure stunnel server on f1 with client certificate authentication @@ -1039,19 +964,16 @@ Stopping stunnel. Starting stunnel. ``` -### How Stunnel Works with CARP +### CARP Control Script for Clean Failover With stunnel configured to bind to the CARP VIP (192.168.1.138), only the server that is currently the CARP MASTER will accept stunnel connections. This provides automatic failover for encrypted NFS: -* When f0 is CARP MASTER: stunnel on f0 accepts connections on 192.168.1.138:2323 -* When f1 becomes CARP MASTER: stunnel on f1 starts accepting connections on 192.168.1.138:2323 +* When `f0` is CARP MASTER: stunnel on `f0` accepts connections on `192.168.1.138:2323` +* When `f1` becomes CARP MASTER: stunnel on `f1` starts accepting connections on `192.168.1.138:2323` * The backup server's stunnel process will fail to bind to the VIP and won't accept connections -This ensures that clients always connect to the active NFS server through the CARP VIP. - -### CARP Control Script for Clean Failover +This ensures that clients always connect to the active NFS server through the CARP VIP. To ensure clean failover behavior and prevent stale file handles, we'll update our `carpcontrol.sh` script so that: -To ensure clean failover behavior and prevent stale file handles, we'll create a control script that: * Stops NFS services on BACKUP nodes (preventing split-brain scenarios) * Starts NFS services only on the MASTER node * Manages stunnel binding to the CARP VIP @@ -1089,32 +1011,11 @@ esac EOF paul@f0:~ % doas chmod +x /usr/local/bin/carpcontrol.sh - -# Add to devd configuration -paul@f0:~ % doas tee -a /etc/devd.conf <<'EOF' - -# CARP state change notifications -notify 0 { - match "system" "CARP"; - match "subsystem" "[0-9]+@[a-z]+[0-9]+"; - match "type" "(MASTER|BACKUP)"; - action "/usr/local/bin/carpcontrol.sh $type"; -}; -EOF - -# Restart devd to apply changes -paul@f0:~ % doas service devd restart ``` -This enhanced script ensures that: -* Only the MASTER node runs NFS and stunnel services -* BACKUP nodes have all services stopped, preventing any client connections -* Failovers are clean with no possibility of accessing the wrong server -* Stale file handles are minimized because the old server immediately stops responding - ### CARP Management Script -To simplify CARP state management and failover testing, create this helper script on both f0 and f1: +To simplify CARP state management and failover testing, create this helper script on both `f0` and `f1`: ```sh # Create the CARP management script @@ -1252,24 +1153,9 @@ paul@f0:~ % doas carp auto-failback enable Auto-failback ENABLED (removed /data/nfs/nfs.NO_AUTO_FAILBACK) ``` -This enhanced script: -- Shows warnings when auto-failback is disabled -- Provides easy control over the auto-failback feature -- Makes failover testing and maintenance simpler - ### Automatic Failback After Reboot -When f0 reboots (planned or unplanned), f1 takes over as CARP MASTER. To ensure f0 automatically reclaims its primary role once it's fully operational, we'll implement an automatic failback mechanism. - -#### Why Automatic Failback? - -- **Primary node preference**: f0 has the primary storage; it should be MASTER when available -- **Post-reboot automation**: Eliminates manual intervention after every f0 reboot -- **Maintenance flexibility**: Can be disabled when you want f1 to remain MASTER - -#### The Auto-Failback Script - -Create this script on f0 only (not on f1): +When `f0` reboots (planned or unplanned), `f1` takes over as CARP MASTER. To ensure `f0` automatically reclaims its primary role once it's fully operational, we'll implement an automatic failback mechanism. With: ```sh paul@f0:~ % doas tee /usr/local/bin/carp-auto-failback.sh <<'EOF' @@ -1297,7 +1183,7 @@ if ! mount | grep -q "on /data/nfs "; then exit 0 fi -# Check if marker file exists (identifies this as primary storage) +# Check if marker file exists (identifies that ZFS data set is properly mounted) if [ ! -f "$MARKER_FILE" ]; then log_message "SKIP: Marker file $MARKER_FILE not found" exit 0 @@ -1333,135 +1219,70 @@ EOF paul@f0:~ % doas chmod +x /usr/local/bin/carp-auto-failback.sh ``` -#### Setting Up the Marker File - -The marker file identifies f0's primary storage. Create it once: +The marker file identifies that the ZFS data set is properly mounted. We create it with: ```sh paul@f0:~ % doas touch /data/nfs/nfs.DO_NOT_REMOVE ``` -This file will be replicated to f1, but since f1 mounts the dataset at a different path, it won't trigger failback there. - -#### Configuring Cron - -Add a cron job to check every minute: +We add a cron job to check every minute: ```sh paul@f0:~ % echo "* * * * * /usr/local/bin/carp-auto-failback.sh" | doas crontab - ``` -#### Managing Automatic Failback +The enhanced CARP script provides integrated control over auto-failback. To temporarily disable automatic failback (e.g., for `f0` maintenance), we run: -The enhanced CARP script provides integrated control over auto-failback: - -**To temporarily disable automatic failback** (e.g., for f0 maintenance): ```sh paul@f0:~ % doas carp auto-failback disable Auto-failback DISABLED (created /data/nfs/nfs.NO_AUTO_FAILBACK) ``` -**To re-enable automatic failback**: +And to re-enable it: + ```sh paul@f0:~ % doas carp auto-failback enable Auto-failback ENABLED (removed /data/nfs/nfs.NO_AUTO_FAILBACK) ``` -**To check if auto-failback is enabled**: +To check whether auto-failback is enabled we run: + ```sh paul@f0:~ % doas carp CARP state on re0 (vhid 1): MASTER # If disabled, you'll see: WARNING: Auto-failback is DISABLED ``` -**To monitor failback attempts**: -```sh -paul@f0:~ % tail -f /var/log/carp-auto-failback.log -``` - -#### How It Works - -1. **After f0 reboots**: f1 is MASTER, f0 boots as BACKUP -2. **Cron runs every minute**: Checks if conditions are met -3. **Safety checks**: - - Is f0 currently BACKUP? (don't run if already MASTER) - - Is /data/nfs mounted? (ZFS datasets are ready) - - Does marker file exist? (confirms this is primary storage) - - Is failback blocked? (admin can prevent failback) - - Are NFS services running? (system is fully ready) -4. **Failback occurs**: Typically 2-3 minutes after boot completes -5. **Logging**: All attempts logged for troubleshooting - -This ensures f0 automatically resumes its role as primary storage server after any reboot, while providing administrative control when needed. - -### Verifying Stunnel and CARP Status - -First, check which host is currently CARP MASTER: +The failback attempts are logged to `/var/log/carp-auto-failback.log`! -```sh -# On f0 - check CARP status -paul@f0:~ % ifconfig re0 | grep carp - inet 192.168.1.130 netmask 0xffffff00 broadcast 192.168.1.255 - inet 192.168.1.138 netmask 0xffffffff broadcast 192.168.1.138 vhid 1 +So, in summary: -# If f0 is MASTER, verify stunnel is listening on the VIP -paul@f0:~ % doas sockstat -l | grep 2323 -stunnel stunnel 1234 3 tcp4 192.168.1.138:2323 *:* +* After f`0 `reboots: `f1` is MASTER, f`0 `boots as BACKUP +* Cron runs every minute: Checks if conditions are met (Is `f0` currently BACKUP? (don't run if already MASTER)), (Is /data/nfs mounted? (ZFS datasets are ready)), (Does marker file exist? (confirms this is primary storage)), (Is failback blocked? (admin can prevent failback)), (Are NFS services running? (system is fully ready)) +* Failback occurs: Typically 2-3 minutes after boot completes +* Logging: All attempts logged for troubleshooting -# On f1 - check CARP status -paul@f1:~ % ifconfig re0 | grep carp - inet 192.168.1.131 netmask 0xffffff00 broadcast 192.168.1.255 +This ensures `f0` automatically resumes its role as primary storage server after any reboot, while providing administrative control when needed. -# If f1 is BACKUP, stunnel won't be able to bind to the VIP -paul@f1:~ % doas tail /var/log/messages | grep stunnel -Jul 4 12:34:56 f1 stunnel: [!] bind: 192.168.1.138:2323: Can't assign requested address (49) -``` - -### Verifying NFS Exports - -Check that the exports are active on both servers: - -```sh -# On f0 -paul@f0:~ % doas showmount -e localhost -Exports list on localhost: -/data/nfs 127.0.0.1 - -# On f1 -paul@f1:~ % doas showmount -e localhost -Exports list on localhost: -/data/nfs 127.0.0.1 -``` - -### Client Configuration for Stunnel +## Client Configuration for Stunnel To mount NFS shares with stunnel encryption, clients need to install and configure stunnel with their client certificates. -#### Preparing Client Certificates +### Configuring Rocky Linux Clients (r0, r1, r2) -On f0, prepare the client certificate packages: - -```sh -# Create combined certificate/key files for each client -paul@f0:~ % cd /usr/local/etc/stunnel/ca -paul@f0:~ % doas sh -c 'for client in r0 r1 r2 earth; do - cat ${client}-cert.pem ${client}-key.pem > /tmp/${client}-stunnel.pem -done' -``` - -#### Configuring Rocky Linux Clients (r0, r1, r2) +On the Rocky Linux VMs, we run: ```sh # Install stunnel on client (example for r0) -[root@r0 ~]# dnf install -y stunnel +[root@r0 ~]# dnf install -y stunnel nfs-utils # Copy client certificate and CA certificate from f0 -[root@r0 ~]# scp f0:/tmp/r0-stunnel.pem /etc/stunnel/ +[root@r0 ~]# scp f0:/usr/local/etc/stunnel/ca/r0-key.pem /etc/stunnel/ [root@r0 ~]# scp f0:/usr/local/etc/stunnel/ca/ca-cert.pem /etc/stunnel/ # Configure stunnel client with certificate authentication [root@r0 ~]# tee /etc/stunnel/stunnel.conf <<'EOF' -cert = /etc/stunnel/r0-stunnel.pem +cert = /etc/stunnel/r0-key.pem CAfile = /etc/stunnel/ca-cert.pem client = yes verify = 2 @@ -1477,11 +1298,11 @@ EOF # Repeat for r1 and r2 with their respective certificates ``` -Note: Each client must use its own certificate file (r0-stunnel.pem, r1-stunnel.pem, r2-stunnel.pem, or earth-stunnel.pem). +Note: Each client must use its own certificate file (`r0-stunnel.pem`, `r1-stunnel.pem`, `r2-stunnel.pem`, or `earth-stunnel.pem` - the latter is my Laptop, which also can mount the NFS shares). ### Testing NFS Mount with Stunnel -Mount NFS through the stunnel encrypted tunnel: +To mount NFS through the stunnel encrypted tunnel, we run: ```sh # Create mount point @@ -1493,67 +1314,13 @@ Mount NFS through the stunnel encrypted tunnel: # Verify mount [root@r0 ~]# mount | grep k3svolumes 127.0.0.1:/data/nfs/k3svolumes on /data/nfs/k3svolumes type nfs4 (rw,relatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=2323,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1) - # For persistent mount, add to /etc/fstab: 127.0.0.1:/data/nfs/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,_netdev 0 0 ``` -Note: The mount uses localhost (127.0.0.1) because stunnel is listening locally and forwarding the encrypted traffic to the remote server. +Note: The mount uses localhost (`127.0.0.1`) because stunnel is listening locally and forwarding the encrypted traffic to the remote server. -Verify the file was written and replicated: - -```sh -# Check on f0 -paul@f0:~ % cat /data/nfs/test-r0.txt -Test from r0 - -# After replication interval (5 minutes), check on f1 -paul@f1:~ % cat /data/nfs/test-r0.txt -Test from r0 -``` - -### Important: Encryption Keys for Replicated Datasets - -When using encrypted ZFS datasets with raw sends (send -w), the replicated datasets on f1 need the encryption keys loaded to access the data: - -```sh -# Check encryption status on f1 -paul@f1:~ % doas zfs get keystatus zdata/sink/f0/zdata/enc/nfsdata -NAME PROPERTY VALUE SOURCE -zdata/sink/f0/zdata/enc/nfsdata keystatus unavailable - - -# Load the encryption key (uses the same key as f0) -paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/sink/f0/zdata/enc/nfsdata - -# Mount the dataset -paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata - -# Configure automatic key loading on boot -paul@f1:~ % doas sysrc zfskeys_datasets="zdata/enc zroot/bhyve zdata/sink/f0/zdata/enc/nfsdata" -zfskeys_datasets: -> zdata/enc zroot/bhyve zdata/sink/f0/zdata/enc/nfsdata -``` - -This ensures that after a reboot, f1 will automatically load the encryption keys and mount all encrypted datasets, including the replicated ones. - -### NFS Failover with CARP and Stunnel - -With NFS servers running on both f0 and f1 and stunnel bound to the CARP VIP: - -* Automatic failover: When f0 fails, CARP automatically promotes f1 to MASTER -* Stunnel failover: The carpcontrol.sh script automatically starts stunnel on the new MASTER -* Client transparency: Clients always connect to 192.168.1.138:2323, which routes to the active server -* No connection disruption: Existing NFS mounts continue working through the same VIP -* Data consistency: ZFS replication ensures f1 has recent data (within 5-minute window) -* Read-only replica: The replicated dataset on f1 is always mounted read-only to prevent breaking replication -* Manual intervention required for full RW failover: When f1 becomes MASTER, you must: - 1. Stop `zrepl` to prevent conflicts: `doas service `zrepl` stop` - 2. Make the replicated dataset writable: `doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata` - 3. Ensure encryption keys are loaded (should be automatic with zfskeys_enable) - 4. NFS will automatically start serving read/write requests through the VIP - -Important: The `/data/nfs` mount on f1 remains read-only during normal operation to ensure replication integrity. In case of a failover, clients can still read data immediately, but write operations require the manual steps above to promote f1 to full read-write mode. - -### Testing CARP Failover +### Testing CARP Failover with mounted clients and stale file handles: To test the failover process: @@ -1574,27 +1341,16 @@ stunnel stunnel 4567 3 tcp4 192.168.1.138:2323 *:* [root@r0 ~]# echo "Test after failover" > /data/nfs/k3svolumes/failover-test.txt ``` -### Handling Stale File Handles After Failover - -After a CARP failover, NFS clients may experience "Stale file handle" errors because they cached file handles from the previous server. To resolve this: +After a CARP failover, NFS clients may experience "Stale file handle" errors because they cached file handles from the previous server. To resolve this manually, we can run: -Manual recovery (immediate fix): ```sh # Force unmount and remount [root@r0 ~]# umount -f /data/nfs/k3svolumes [root@r0 ~]# mount /data/nfs/k3svolumes ``` -Automatic recovery options: +For the automatic recovery, we create a script: -1. Use soft mounts with shorter timeouts in `/etc/fstab`: -``` -127.0.0.1:/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,_netdev,soft,timeo=10,retrans=2,intr 0 0 -``` - -2. Create an automatic recovery system using systemd timers (checks every 10 seconds): - -First, create the monitoring script: ```sh [root@r0 ~]# cat > /usr/local/bin/check-nfs-mount.sh << 'EOF' #!/bin/bash @@ -1644,7 +1400,8 @@ EOF [root@r0 ~]# chmod +x /usr/local/bin/check-nfs-mount.sh ``` -Create the systemd service: +And we create the systemd service as follows: + ```sh [root@r0 ~]# cat > /etc/systemd/system/nfs-mount-monitor.service << 'EOF' [Unit] @@ -1659,7 +1416,8 @@ StandardError=journal EOF ``` -Create the systemd timer (runs every 10 seconds): +And we also create the systemd timer (runs every 10 seconds): + ```sh [root@r0 ~]# cat > /etc/systemd/system/nfs-mount-monitor.timer << 'EOF' [Unit] @@ -1676,7 +1434,8 @@ WantedBy=timers.target EOF ``` -Enable and start the timer: +To enable and start the timer, we run: + ```sh [root@r0 ~]# systemctl daemon-reload [root@r0 ~]# systemctl enable nfs-mount-monitor.timer @@ -1693,9 +1452,7 @@ Enable and start the timer: [root@r0 ~]# journalctl -u nfs-mount-monitor -f ``` -3. For Kubernetes, use liveness probes that restart pods when NFS becomes stale - -Note: Stale file handles are inherent to NFS failover because file handles are server-specific. The best approach depends on your application's tolerance for brief disruptions. +Note: Stale file handles are inherent to NFS failover because file handles are server-specific. The best approach depends on your application's tolerance for brief disruptions. Of course, all the changes above made on `r0` also must be applied to `r1` and `r2`! ### Complete Failover Test @@ -1726,651 +1483,63 @@ Jul 06 10:15:33 r0 nfs-monitor[1234]: NFS mount fixed at Sun Jul 6 10:15:33 EEST ``` Failover Timeline: + * 0 seconds: CARP failover triggered * 0-2 seconds: Clients get "Stale file handle" errors (not hanging) * 3-10 seconds: Soft mounts ensure quick failure of operations * Within 10 seconds: Automatic recovery via systemd timer Benefits of the Optimized Setup: -1. No hanging processes - Soft mounts fail quickly -2. Clean failover - Old server stops serving immediately -3. Fast automatic recovery - No manual intervention needed -4. Predictable timing - Recovery within 10 seconds with systemd timer -5. Better visibility - systemd journal provides detailed logs - -Important Considerations: -* Recent writes (within 5 minutes) may not be visible after failover due to replication lag -* Applications should handle brief NFS errors gracefully -* For zero-downtime requirements, consider synchronous replication or distributed storage - -### Verifying Replication Status - -To check if replication is working correctly: - -```sh -# Check replication status -paul@f0:~ % doas `zrepl` status - -# Check recent snapshots on source -paul@f0:~ % doas zfs list -t snapshot -o name,creation zdata/enc/nfsdata | tail -5 - -# Check recent snapshots on destination -paul@f1:~ % doas zfs list -t snapshot -o name,creation zdata/sink/f0/zdata/enc/nfsdata | tail -5 - -# Verify data appears on f1 (should be read-only) -paul@f1:~ % ls -la /data/nfs/k3svolumes/ -``` - -Important: If you see "connection refused" errors in `zrepl` logs, ensure: -* Both servers have `zrepl` running (`doas service `zrepl` status`) -* No firewall or hosts.allow rules are blocking port 8888 -* WireGuard is up if using WireGuard IPs for replication - -### Post-Reboot Verification - -After rebooting the FreeBSD servers, verify the complete stack: - -```sh -# Check CARP status on all servers -paul@f0:~ % ifconfig re0 | grep carp -paul@f1:~ % ifconfig re0 | grep carp - -# Verify stunnel is running on the MASTER -paul@f0:~ % doas sockstat -l | grep 2323 - -# Check NFS is exported -paul@f0:~ % doas showmount -e localhost - -# Verify all r servers have NFS mounted -[root@r0 ~]# mount | grep nfs -[root@r1 ~]# mount | grep nfs -[root@r2 ~]# mount | grep nfs - -# Test write access -[root@r0 ~]# echo "Test after reboot $(date)" > /data/nfs/k3svolumes/test-reboot.txt - -# Verify `zrepl` is running and replicating -paul@f0:~ % doas service `zrepl` status -paul@f1:~ % doas service `zrepl` status -``` - -### Integration with Kubernetes - -In your Kubernetes manifests, you can now create PersistentVolumes using the NFS servers: - -``` -apiVersion: v1 -kind: PersistentVolume -metadata: - name: nfs-pv -spec: - capacity: - storage: 100Gi - accessModes: - - ReadWriteMany - nfs: - server: 192.168.1.138 # f3s-storage-ha.lan (CARP virtual IP) - path: /data/nfs/k3svolumes - mountOptions: - - nfsvers=4 - - tcp - - hard - - intr -``` - -Using the CARP virtual IP (192.168.1.138) instead of direct server IPs ensures that Kubernetes workloads continue to access storage even if the primary NFS server fails. For encryption, configure stunnel on the Kubernetes nodes. - -### Security Benefits of Stunnel with Client Certificates - -Using stunnel with client certificate authentication for NFS encryption provides several advantages: - -* Compatibility: Works with any NFS version and between different operating systems -* Strong encryption: Uses TLS/SSL with configurable cipher suites -* Transparent: Applications don't need modification, encryption happens at transport layer -* Performance: Minimal overhead (~2% in benchmarks) -* Flexibility: Can encrypt any TCP-based protocol, not just NFS -* Strong Authentication: Client certificates provide cryptographic proof of identity -* Access Control: Only clients with valid certificates signed by your CA can connect -* Certificate Revocation: You can revoke access by removing certificates from the CA -### Laptop/Workstation Access +* No hanging processes - Soft mounts fail quickly +* Clean failover - Old server stops serving immediately +* Fast automatic recovery - No manual intervention needed +* Predictable timing - Recovery within 10 seconds with systemd timer +* Better visibility - systemd journal provides detailed logs -For development workstations like "earth" (laptop), the same stunnel configuration works, but there's an important caveat with NFSv4: - -```sh -# Install stunnel -sudo dnf install stunnel - -# Configure stunnel (/etc/stunnel/stunnel.conf) -cert = /etc/stunnel/earth-stunnel.pem -CAfile = /etc/stunnel/ca-cert.pem -client = yes -verify = 2 - -[nfs-ha] -accept = 127.0.0.1:2323 -connect = 192.168.1.138:2323 - -# Enable and start stunnel -sudo systemctl enable --now stunnel - -# Mount NFS through stunnel -sudo mount -t nfs4 -o port=2323 127.0.0.1:/ /data/nfs - -# Make persistent in /etc/fstab -127.0.0.1:/ /data/nfs nfs4 port=2323,hard,intr,_netdev 0 0 -``` - -#### Important: NFSv4 and Stunnel on Newer Linux Clients - -On newer Linux distributions (like Fedora 42+), NFSv4 only uses the specified port for initial mount negotiation, but then establishes data connections directly to port 2049, bypassing stunnel. This doesn't occur on Rocky Linux 9 VMs, which properly route all traffic through the specified port. - -To ensure all NFS traffic goes through the encrypted tunnel on affected systems, you need to use iptables: - -```sh -# Redirect all NFS traffic to the CARP VIP through stunnel -sudo iptables -t nat -A OUTPUT -d 192.168.1.138 -p tcp --dport 2049 -j DNAT --to-destination 127.0.0.1:2323 - -# Make it persistent (example for Fedora) -sudo dnf install iptables-services -sudo service iptables save -sudo systemctl enable iptables - -# Or create a startup script -cat > ~/setup-nfs-stunnel.sh << 'EOF' -#!/bin/bash -# Ensure NFSv4 data connections go through stunnel -sudo iptables -t nat -D OUTPUT -d 192.168.1.138 -p tcp --dport 2049 -j DNAT --to-destination 127.0.0.1:2323 2>/dev/null -sudo iptables -t nat -A OUTPUT -d 192.168.1.138 -p tcp --dport 2049 -j DNAT --to-destination 127.0.0.1:2323 -EOF -chmod +x ~/setup-nfs-stunnel.sh -``` - -To verify all traffic is encrypted: -```sh -# Check active connections -sudo ss -tnp | grep -E ":2049|:2323" -# You should see connections to localhost:2323 (stunnel), not direct to the CARP VIP - -# Monitor stunnel logs -journalctl -u stunnel -f -# You should see connection logs for all NFS operations -``` - -Note: The laptop has full access to `/data/nfs` with the `-alldirs` export option, while Kubernetes nodes are restricted to `/data/nfs/k3svolumes`. - -The client certificate requirement ensures that: -* Only authorized clients (r0, r1, r2, and earth) can establish stunnel connections -* Each client has a unique identity that can be individually managed -* Stolen IP addresses alone cannot grant access without the corresponding certificate -* Access can be revoked without changing the server configuration - -The combination of ZFS encryption at rest and stunnel in transit ensures data is protected throughout its lifecycle. - -This configuration provides a solid foundation for shared storage in the f3s Kubernetes cluster, with automatic replication and encrypted transport. - -## Mounting NFS on Rocky Linux 9 - -### Installing and Configuring NFS Clients on r0, r1, and r2 - -First, install the necessary packages on all three Rocky Linux nodes: - -```sh -# On r0, r1, and r2 -dnf install -y nfs-utils stunnel -``` - -### Configuring Stunnel Client on All Nodes - -Copy the certificate and configure stunnel on each Rocky Linux node: - -```sh -# On r0 -scp f0:/usr/local/etc/stunnel/stunnel.pem /etc/stunnel/ -tee /etc/stunnel/stunnel.conf <<'EOF' -cert = /etc/stunnel/stunnel.pem -client = yes - -[nfs-ha] -accept = 127.0.0.1:2323 -connect = 192.168.1.138:2323 -EOF - -systemctl enable --now stunnel - -# Repeat the same configuration on r1 and r2 -``` - -### Setting Up NFS Mounts - -Create mount points and configure persistent mounts on all nodes: - -```sh -# On r0, r1, and r2 -mkdir -p /data/nfs/k3svolumes - -# Add to /etc/fstab for persistent mount (note the NFSv4 relative path) -echo '127.0.0.1:/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,hard,intr,_netdev 0 0' >> /etc/fstab - -# Mount the share -mount /data/nfs/k3svolumes -``` - -### Comprehensive NFS Mount Testing - -Here's a detailed test plan to verify NFS mounts are working correctly on all nodes: - -#### Test 1: Verify Mount Status on All Nodes - -```sh -# On r0 -[root@r0 ~]# mount | grep k3svolumes -# Expected output: -# 127.0.0.1:/data/nfs/k3svolumes on /data/nfs/k3svolumes type nfs4 (rw,relatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=2323,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1) - -# On r1 -[root@r1 ~]# mount | grep k3svolumes -# Should show similar output - -# On r2 -[root@r2 ~]# mount | grep k3svolumes -# Should show similar output -``` - -#### Test 2: Verify Stunnel Connectivity - -```sh -# On r0 -[root@r0 ~]# systemctl status stunnel -# Should show: Active: active (running) - -[root@r0 ~]# ss -tnl | grep 2323 -# Should show: LISTEN 0 128 127.0.0.1:2323 0.0.0.0:* - -# Test connection to CARP VIP -[root@r0 ~]# nc -zv 192.168.1.138 2323 -# Should show: Connection to 192.168.1.138 2323 port [tcp/*] succeeded! - -# Repeat on r1 and r2 -``` - -#### Test 3: File Creation and Visibility Test - -```sh -# On r0 - Create test file -[root@r0 ~]# echo "Test from r0 - $(date)" > /data/nfs/k3svolumes/test-r0.txt -[root@r0 ~]# ls -la /data/nfs/k3svolumes/test-r0.txt -# Should show the file with timestamp - -# On r1 - Create test file and check r0's file -[root@r1 ~]# echo "Test from r1 - $(date)" > /data/nfs/k3svolumes/test-r1.txt -[root@r1 ~]# ls -la /data/nfs/k3svolumes/ -# Should show both test-r0.txt and test-r1.txt - -# On r2 - Create test file and check all files -[root@r2 ~]# echo "Test from r2 - $(date)" > /data/nfs/k3svolumes/test-r2.txt -[root@r2 ~]# ls -la /data/nfs/k3svolumes/ -# Should show all three files: test-r0.txt, test-r1.txt, test-r2.txt -``` - -#### Test 4: Verify Files on Storage Servers - -```sh -# On f0 (primary storage) -paul@f0:~ % ls -la /data/nfs/k3svolumes/ -# Should show all three test files - -# Wait 5 minutes for replication, then check on f1 -paul@f1:~ % ls -la /data/nfs/k3svolumes/ -# Should show all three test files (after replication) -``` - -#### Test 5: Performance and Concurrent Access Test - -```sh -# On r0 - Write large file -[root@r0 ~]# dd if=/dev/zero of=/data/nfs/k3svolumes/test-large-r0.dat bs=1M count=100 -# Should complete without errors - -# On r1 - Read the file while r2 writes -[root@r1 ~]# dd if=/data/nfs/k3svolumes/test-large-r0.dat of=/dev/null bs=1M & -# Simultaneously on r2 -[root@r2 ~]# dd if=/dev/zero of=/data/nfs/k3svolumes/test-large-r2.dat bs=1M count=100 - -# Check for any errors or performance issues -``` - -#### Test 6: Directory Operations Test - -```sh -# On r0 - Create directory structure -[root@r0 ~]# mkdir -p /data/nfs/k3svolumes/test-dir/subdir1/subdir2 -[root@r0 ~]# echo "Deep file" > /data/nfs/k3svolumes/test-dir/subdir1/subdir2/deep.txt - -# On r1 - Verify and add files -[root@r1 ~]# ls -la /data/nfs/k3svolumes/test-dir/subdir1/subdir2/ -[root@r1 ~]# echo "Another file from r1" > /data/nfs/k3svolumes/test-dir/subdir1/file-r1.txt - -# On r2 - Verify complete structure -[root@r2 ~]# find /data/nfs/k3svolumes/test-dir -type f -# Should show both files -``` - -#### Test 7: Permission and Ownership Test - -```sh -# On r0 - Create files with different permissions -[root@r0 ~]# touch /data/nfs/k3svolumes/test-perms-644.txt -[root@r0 ~]# chmod 644 /data/nfs/k3svolumes/test-perms-644.txt -[root@r0 ~]# touch /data/nfs/k3svolumes/test-perms-755.txt -[root@r0 ~]# chmod 755 /data/nfs/k3svolumes/test-perms-755.txt - -# On r1 and r2 - Verify permissions are preserved -[root@r1 ~]# ls -l /data/nfs/k3svolumes/test-perms-*.txt -[root@r2 ~]# ls -l /data/nfs/k3svolumes/test-perms-*.txt -# Permissions should match what was set on r0 -``` - -#### Test 8: Failover Test (Optional but Recommended) - -```sh -# On f0 - Trigger CARP failover -paul@f0:~ % doas ifconfig re0 vhid 1 state backup - -# On all Rocky nodes - Verify mounts still work -[root@r0 ~]# echo "Test during failover from r0 - $(date)" > /data/nfs/k3svolumes/failover-test-r0.txt -[root@r1 ~]# echo "Test during failover from r1 - $(date)" > /data/nfs/k3svolumes/failover-test-r1.txt -[root@r2 ~]# echo "Test during failover from r2 - $(date)" > /data/nfs/k3svolumes/failover-test-r2.txt - -# Verify all files are accessible -[root@r0 ~]# ls -la /data/nfs/k3svolumes/failover-test-*.txt - -# On f1 - Verify it's now MASTER -paul@f1:~ % ifconfig re0 | grep carp -# Should show the VIP 192.168.1.138 - -# Restore f0 as MASTER -paul@f0:~ % doas ifconfig re0 vhid 1 state master -``` - -### Troubleshooting Common Issues - -#### Mount Hangs or Times Out - -```sh -# Check stunnel connectivity -systemctl status stunnel -ss -tnl | grep 2323 -telnet 127.0.0.1 2323 - -# Check if you can reach the CARP VIP -ping 192.168.1.138 -nc -zv 192.168.1.138 2323 - -# Check for firewall issues -iptables -L -n | grep 2323 -``` - -#### Permission Denied Errors - -```sh -# Verify the export allows your IP -# On f0 or f1 -doas showmount -e localhost - -# Check if SELinux is blocking (on Rocky Linux) -getenforce -# If enforcing, try: -setenforce 0 # Temporary for testing -# Or add proper SELinux context: -setsebool -P use_nfs_home_dirs 1 -``` - -#### Files Not Visible Across Nodes - -```sh -# Force NFS cache refresh -# On the affected node -umount /data/nfs/k3svolumes -mount /data/nfs/k3svolumes - -# Check NFS version -nfsstat -m -# Should show NFSv4 -``` - -#### I/O Errors When Accessing NFS Mount - -I/O errors can have several causes: - -1. Missing localhost in exports (most common with stunnel): - - Since stunnel proxies connections, the NFS server sees requests from 127.0.0.1 - - Ensure your exports include localhost access: - ``` - /data/nfs/k3svolumes -maproot=root -network 127.0.0.1 -mask 255.255.255.255 - ``` - -2. Stunnel connection issues or CARP failover: - -```sh -# On the affected node (e.g., r0) -# Check stunnel is running -systemctl status stunnel - -# Restart stunnel to re-establish connection -systemctl restart stunnel - -# Force remount -umount -f -l /data/nfs/k3svolumes -mount -t nfs4 -o port=2323,hard,intr 127.0.0.1:/data/nfs/k3svolumes /data/nfs/k3svolumes - -# Check which FreeBSD host is CARP MASTER -# On f0 -ssh f0 "ifconfig re0 | grep carp" -# On f1 -ssh f1 "ifconfig re0 | grep carp" - -# Verify stunnel on MASTER is bound to VIP -# On the MASTER host -ssh <master-host> "sockstat -l | grep 2323" - -# Debug stunnel connection -openssl s_client -connect 192.168.1.138:2323 </dev/null - -# If persistent I/O errors, check logs -journalctl -u stunnel -n 50 -dmesg | tail -20 | grep -i nfs -``` - -### Comprehensive Production Test Results - -After implementing all the improvements (enhanced CARP control script, soft mounts, and automatic recovery), here's a complete test of the setup including reboots and failovers: - -#### Test Scenario: Full System Reboot and Failover - -``` -1. Initial state: Rebooted all servers (f0, f1, f2) - - Result: f1 became CARP MASTER after reboot (not always f0) - - NFS accessible and writable from all clients - -2. Created test file from laptop: - paul@earth:~ % echo "Post-reboot test at $(date)" > /data/nfs/k3svolumes/reboot-test.txt - -3. Verified 1-minute replication to f1: - - File appeared on f1 within 70 seconds - - Content identical on both servers - -4. Performed failover from f0 to f1: - paul@f0:~ % doas ifconfig re0 vhid 1 state backup - - f1 immediately became MASTER - - Clients experienced "Stale file handle" errors - - With soft mounts: No hanging, immediate error response - -5. Recovery time: - - Manual recovery: Immediate with umount/mount - - Automatic recovery: Within 10 seconds via systemd timer - - No data loss during failover - -6. Failback to f0: - paul@f1:~ % doas ifconfig re0 vhid 1 state backup - - f0 reclaimed MASTER status - - Similar stale handle behavior - - Recovery within 10 seconds -``` - -#### Key Findings - -1. CARP Master Selection: After reboot, either f0 or f1 can become MASTER. This is normal CARP behavior and doesn't affect functionality. - -2. Stale File Handles: Despite all optimizations, NFS clients still experience stale file handles during failover. This is inherent to NFS protocol design. However: - - Soft mounts prevent hanging - - Automatic recovery works reliably - - No data loss occurs - -3. Replication Timing: The 1-minute replication interval for NFS data ensures minimal data loss window during unplanned failovers. The Fedora VM replication runs every 10 minutes, which is sufficient for less critical VM data. - -4. Service Management: The enhanced carpcontrol.sh script successfully stops services on BACKUP nodes, preventing split-brain scenarios. - -## Performance Considerations - -### Encryption Overhead - -Stunnel adds CPU overhead for TLS encryption/decryption. On modern hardware, the impact is minimal: - -* Beelink Mini PCs: With hardware AES acceleration, expect 5-10% CPU overhead -* Network throughput: Gigabit Ethernet is usually the bottleneck, not TLS -* Latency: Adds <1ms in LAN environments - -For reference, with AES-256-GCM on a typical mini PC: -* Sequential reads: ~110 MB/s (near line-speed for gigabit) -* Sequential writes: ~105 MB/s -* Random 4K IOPS: ~15% reduction compared to unencrypted - -### Replication Bandwidth - -ZFS replication with `zrepl` is efficient, only sending changed blocks: - -* Initial sync: Full dataset size (can be large) -* Incremental: Typically <1% of dataset size per snapshot -* Network usage: With 1-minute intervals and moderate changes, expect 10-50 MB/minute - -To monitor replication bandwidth: -```sh -# On f0, check network usage on WireGuard interface -doas systat -ifstat 1 -# Look for wg0 traffic during replication -``` - -### NFS Tuning - -For optimal performance with Kubernetes workloads: - -```sh -# On NFS server (f0/f1) - /etc/sysctl.conf -vfs.nfsd.async=1 # Enable async writes (careful with data integrity) -vfs.nfsd.cachetcp=1 # Cache TCP connections -vfs.nfsd.tcphighwater=64 # Increase TCP connection limit - -# On NFS clients - mount options -rsize=131072,wsize=131072 # Larger read/write buffers -hard,intr # Hard mount with interruption -vers=4.2 # Use latest NFSv4.2 for best performance -``` - -### ZFS Tuning - -Key ZFS settings for NFS storage: - -```sh -# Set on the NFS dataset -zfs set compression=lz4 zdata/enc/nfsdata # Fast compression -zfs set atime=off zdata/enc/nfsdata # Disable access time updates -zfs set redundant_metadata=most zdata/enc/nfsdata # Protect metadata -``` - -### Monitoring - -Monitor system performance to identify bottlenecks: - -```sh -# CPU and memory -doas top -P - -# Disk I/O -doas gstat -p - -# Network traffic -doas netstat -w 1 -h - -# ZFS statistics -doas zpool iostat -v 1 - -# NFS statistics -doas nfsstat -s -w 1 -``` - -### Cleanup After Testing +Important Considerations: -```sh -# Remove test files (run on any node) -rm -f /data/nfs/k3svolumes/test-*.txt -rm -f /data/nfs/k3svolumes/test-large-*.dat -rm -f /data/nfs/k3svolumes/failover-test-*.txt -rm -f /data/nfs/k3svolumes/test-perms-*.txt -rm -rf /data/nfs/k3svolumes/test-dir -``` - -This comprehensive testing ensures that: -* All nodes can mount the NFS share -* Files created on one node are visible on all others -* The encrypted stunnel connection is working -* Permissions and ownership are preserved -* The setup can handle concurrent access -* Failover works correctly (if tested) +* Recent writes (within 1 minute) may not be visible after failover due to replication lag +* Applications should handle brief NFS errors gracefully +* For zero-downtime requirements, consider synchronous replication or distributed storage (see "Future storage explorations" section later in this blog post) ## Conclusion We've built a robust, encrypted storage system for our FreeBSD-based Kubernetes cluster that provides: -### What We Achieved - * High Availability: CARP ensures the storage VIP moves automatically during failures * Data Protection: ZFS encryption protects data at rest, stunnel protects data in transit -* Continuous Replication: 1-minute RPO for critical data, automated via zrepl +* Continuous Replication: 1-minute RPO for the data, automated via `zrepl` * Secure Access: Client certificate authentication prevents unauthorized access -* Kubernetes Integration: Shared storage accessible from all cluster nodes -### Architecture Benefits +Some key lessons learned are: + +* Stunnel vs Native NFS/TLS: While native encryption would be ideal, stunnel provides better cross-platform compatibility +* Manual vs Automatic Failover: For storage systems, controlled failover often prevents more problems than it causes +* Client Compatibility: Different NFS implementations behave differently - test thoroughly + +## Future Storage Explorations -This design prioritizes data integrity over pure availability: -* Manual failover prevents split-brain scenarios -* Certificate-based authentication provides strong security -* Encrypted replication protects data even over untrusted networks -* ZFS snapshots enable point-in-time recovery +While `zrepl` provides excellent snapshot-based replication for disaster recovery, there are other storage technologies worth exploring for the f3s project: -### Lessons Learned +### MinIO for S3-Compatible Object Storage -1. Stunnel vs Native NFS/TLS: While native encryption would be ideal, stunnel provides better cross-platform compatibility -2. Manual vs Automatic Failover: For storage systems, controlled failover often prevents more problems than it causes -3. Replication Frequency: Balance between data protection (RPO) and system load -4. Client Compatibility: Different NFS implementations behave differently - test thoroughly +MinIO is a high-performance, S3-compatible object storage system that could complement our ZFS-based storage. Some potential use cases: -### Next Steps +* S3 API compatibility: Many modern applications expect S3-style object storage APIs. MinIO could provide this interface while using our ZFS storage as the backend. +* Multi-site replication: MinIO supports active-active replication across multiple sites, which could work well with our f0/f1/f2 node setup. +* Kubernetes native: MinIO has excellent Kubernetes integration with operators and CSI drivers, making it ideal for the f3s k3s environment. -With reliable storage in place, we can now: -* Deploy stateful applications on Kubernetes -* Set up databases with persistent volumes -* Create shared configuration stores -* Implement backup strategies using ZFS snapshots +### MooseFS for Distributed High Availability -The storage layer is the foundation for any serious Kubernetes deployment. By building it on FreeBSD with ZFS, CARP, and stunnel, we get enterprise-grade features on commodity hardware. +MooseFS is a fault-tolerant, distributed file system that could provide true high-availability storage: -### References +* True HA: Unlike our current setup which requires manual failover, MooseFS provides automatic failover with no single point of failure. +* POSIX compliance: Applications can use MooseFS like any regular filesystem, no code changes needed. +* Flexible redundancy: Configure different replication levels per directory or file, optimizing storage efficiency. +* FreeBSD support: MooseFS has native FreeBSD support, making it a natural fit for the f3s project. -* FreeBSD CARP documentation: https://docs.freebsd.org/en/books/handbook/advanced-networking/#carp -* ZFS encryption guide: https://docs.freebsd.org/en/books/handbook/zfs/#zfs-encryption -* Stunnel documentation: https://www.stunnel.org/docs.html -* `zrepl` documentation: https://zrepl.github.io/ +Both technologies could potentially run on top of our encrypted ZFS volumes, combining ZFS's data integrity and encryption features with distributed storage capabilities. This would be particularly interesting for workloads that need either S3-compatible APIs (MinIO) or transparent distributed POSIX storage (MooseFS). Other *BSD-related posts: |
