gemfeed/DRAFT-f3s-kubernetes-with-freebsd-part-6.gmi.tpl


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104

# f3s: Kubernetes with FreeBSD - Part 6: Storage

> Published at 2025-04-04T23:21:01+03:00

This is the sixth blog post about the f3s series for self-hosting demands in a home lab. f3s? The "f" stands for FreeBSD, and the "3s" stands for k3s, the Kubernetes distribution used on FreeBSD-based physical machines.

<< template::inline::index f3s-kubernetes-with-freebsd-part

=> ./f3s-kubernetes-with-freebsd-part-1/f3slogo.png f3s logo

<< template::inline::toc

## Introduction

In this blog post, we are going to extend the Beelinks with some additional storage.

Some photos here, describe why there are 2 different models of SSD drives (replication etc)

## ZFS encryption keys

### UFS on USB keys

```
paul@f0:/ % doas camcontrol devlist
<512GB SSD D910R170>               at scbus0 target 0 lun 0 (pass0,ada0)
<Samsung SSD 870 EVO 1TB SVT03B6Q>  at scbus1 target 0 lun 0 (pass1,ada1)
<Generic Flash Disk 8.07>          at scbus2 target 0 lun 0 (da0,pass2)
paul@f0:/ %
```

```
paul@f1:/ % doas camcontrol devlist
<512GB SSD D910R170>               at scbus0 target 0 lun 0 (pass0,ada0)
<CT1000BX500SSD1 M6CR072>          at scbus1 target 0 lun 0 (pass1,ada1)
<Generic Flash Disk 8.07>          at scbus2 target 0 lun 0 (da0,pass2)
paul@f1:/ %
```

```sh
paul@f0:/ % doas newfs /dev/da0
/dev/da0: 15000.0MB (30720000 sectors) block size 32768, fragment size 4096
        using 24 cylinder groups of 625.22MB, 20007 blks, 80128 inodes.
        with soft updates
super-block backups (for fsck_ffs -b #) at:
 192, 1280640, 2561088, 3841536, 5121984, 6402432, 7682880, 8963328, 10243776,
11524224, 12804672, 14085120, 15365568, 16646016, 17926464, 19206912,k 20487360,
...

paul@f0:/ % echo '/dev/da0 /keys ufs rw 0 2' | doas tee -a /etc/fstab
/dev/da0 /keys ufs rw 0 2
paul@f0:/ % doas mkdir /keys
paul@f0:/ % doas mount /keys
paul@f0:/ % df | grep keys
/dev/da0             14877596       8  13687384     0%    /keys
```

### Generating encryption keys

paul@f0:/keys % doas openssl rand -out /keys/f0.lan.buetow.org:bhyve.key 32
paul@f0:/keys % doas openssl rand -out /keys/f1.lan.buetow.org:bhyve.key 32
paul@f0:/keys % doas openssl rand -out /keys/f2.lan.buetow.org:bhyve.key 32
paul@f0:/keys % doas openssl rand -out /keys/f0.lan.buetow.org:zdata.key 32
paul@f0:/keys % doas openssl rand -out /keys/f1.lan.buetow.org:zdata.key 32
paul@f0:/keys % doas openssl rand -out /keys/f2.lan.buetow.org:zdata.key 32
paul@f0:/keys % doas chown root *
paul@f0:/keys % doas chmod 400 *

paul@f0:/keys % ls -l
total 20
-r--------  1 root wheel 32 May 25 13:07 f0.lan.buetow.org:bhyve.key
-r--------  1 root wheel 32 May 25 13:07 f1.lan.buetow.org:bhyve.key
-r--------  1 root wheel 32 May 25 13:07 f2.lan.buetow.org:bhyve.key
-r--------  1 root wheel 32 May 25 13:07 f0.lan.buetow.org:zdata.key
-r--------  1 root wheel 32 May 25 13:07 f1.lan.buetow.org:zdata.key
-r--------  1 root wheel 32 May 25 13:07 f2.lan.buetow.org:zdata.key

Copy those to all 3 nodes to /keys

### Configuring `zdata` ZFS pool and encryption

```sh
paul@f0:/keys % doas zpool create -m /data zdata /dev/ada1
paul@f0:/keys % doas zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///keys/`hostname`:zdata.key zdata/enc
paul@f0:/ % zfs list | grep zdata
zdata                                          836K   899G    96K  /data
zdata/enc                                      200K   899G   200K  /data/enc
paul@f0:/keys % zfs get all zdata/enc | grep -E -i '(encryption|key)'
zdata/enc  encryption            aes-256-gcm                               -
zdata/enc  keylocation           file:///keys/f0.lan.buetow.org:zdata.key  local
zdata/enc  keyformat             raw                                       -
zdata/enc  encryptionroot        zdata/enc                                 -
zdata/enc  keystatus             available                                 -
````

### Migrating Bhyve VMs to encrypted `bhyve` ZFS volume

Run on all 3 nodes

```sh
paul@f0:/keys % doas vm stop rocky
Sending ACPI shutdown to rocky

paul@f0:/keys % doas vm list
NAME     DATASTORE  LOADER     CPU  MEMORY  VNC  AUTO     STATE
rocky    default    uefi       4    14G     -    Yes [1]  Stopped


paul@f0:/keys % doas zfs rename zroot/bhyve zroot/bhyve_old
paul@f0:/keys % doas zfs set mountpoint=/mnt zroot/bhyve_old
paul@f0:/keys % doas zfs snapshot zroot/bhyve_old/rocky@hamburger


paul@f0:/keys % doas zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///keys/`hostname`:bhyve.key zroot/bhyve
paul@f0:/keys % doas zfs set mountpoint=/zroot/bhyve zroot/bhyve
paul@f0:/keys % doas zfs set mountpoint=/zroot/bhyve/rocky zroot/bhyve/rocky

paul@f0:/keys % doas zfs send zroot/bhyve_old/rocky@hamburger | doas zfs recv zroot/bhyve/rocky
paul@f0:/keys % doas cp -Rp /mnt/.config /zroot/bhyve/
paul@f0:/keys % doas cp -Rp /mnt/.img /zroot/bhyve/
paul@f0:/keys % doas cp -Rp /mnt/.templates /zroot/bhyve/
paul@f0:/keys % doas cp -Rp /mnt/.iso /zroot/bhyve/

paul@f0:/keys % doas sysrc zfskeys_enable=YES
zfskeys_enable:  -> YES
paul@f0:/keys % doas vm init
paul@f0:/keys % doas reboot
.
.
.
paul@f0:~ % doas vm list
paul@f0:~ % doas vm list
NAME     DATASTORE  LOADER     CPU  MEMORY  VNC           AUTO     STATE
rocky    default    uefi       4    14G     0.0.0.0:5900  Yes [1]  Running (2265)
```

```sh
paul@f0:~ % doas zfs destroy -R zroot/bhyve_old

paul@f0:~ % zfs get all zroot/bhyve | grep -E '(encryption|key)'
zroot/bhyve  encryption            aes-256-gcm                               -
zroot/bhyve  keylocation           file:///keys/f0.lan.buetow.org:bhyve.key  local
zroot/bhyve  keyformat             raw                                       -
zroot/bhyve  encryptionroot        zroot/bhyve                               -
zroot/bhyve  keystatus             available                                 -
paul@f0:~ % zfs get all zroot/bhyve/rocky | grep -E '(encryption|key)'
zroot/bhyve/rocky  encryption            aes-256-gcm            -
zroot/bhyve/rocky  keylocation           none                   default
zroot/bhyve/rocky  keyformat             raw                    -
zroot/bhyve/rocky  encryptionroot        zroot/bhyve            -
zroot/bhyve/rocky  keystatus             available              -
```

## CARP

adding to /etc/rc.conf on f0 and f1:
ifconfig_re0_alias0="inet vhid 1 pass testpass alias 192.168.1.138/32"

adding to /etc/hosts (on n0, n1, n2, r0, r1, r2):

```
192.168.1.138 f3s-storage-ha f3s-storage-ha.lan f3s-storage-ha.lan.buetow.org
192.168.2.138 f3s-storage-ha f3s-storage-ha.wg0 f3s-storage-ha.wg0.wan.buetow.org
```

Adding on f0 and f1:

paul@f0:~ % cat <<END | doas tee -a /etc/devd.conf
notify 0 {
        match "system"          "CARP";
        match "subsystem"       "[0-9]+@[0-9a-z.]+";
        match "type"            "(MASTER|BACKUP)";
        action "/usr/local/bin/carpcontrol.sh $subsystem $type";
};
END

Next, create the CARP control script that will restart stunnel when CARP state changes:

```sh
paul@f0:~ % doas tee /usr/local/bin/carpcontrol.sh <<'EOF'
#!/bin/sh
# CARP state change handler for storage failover

subsystem=$1
state=$2

logger "CARP state change: $subsystem is now $state"

case "$state" in
    MASTER)
        # Restart stunnel to bind to the VIP
        service stunnel restart
        logger "Restarted stunnel for MASTER state"
        ;;
    BACKUP)
        # Stop stunnel since we can't bind to VIP as BACKUP
        service stunnel stop
        logger "Stopped stunnel for BACKUP state"
        ;;
esac
EOF

paul@f0:~ % doas chmod +x /usr/local/bin/carpcontrol.sh

# Copy the same script to f1
paul@f0:~ % scp /usr/local/bin/carpcontrol.sh f1:/tmp/
paul@f1:~ % doas mv /tmp/carpcontrol.sh /usr/local/bin/
paul@f1:~ % doas chmod +x /usr/local/bin/carpcontrol.sh
```

Enable CARP in /boot/loader.conf:

```sh
paul@f0:~ % echo 'carp_load="YES"' | doas tee -a /boot/loader.conf
carp_load="YES"
paul@f1:~ % echo 'carp_load="YES"' | doas tee -a /boot/loader.conf  
carp_load="YES"
```

Then reboot both hosts or run `doas kldload carp` to load the module immediately. 


## ZFS Replication with zrepl

In this section, we'll set up automatic ZFS replication from f0 to f1 using zrepl. This ensures our data is replicated across nodes for redundancy.

### Why zrepl instead of HAST?

While HAST (Highly Available Storage) is FreeBSD's native solution for high-availability storage, I've chosen zrepl for several important reasons:

1. **HAST can cause ZFS corruption**: HAST operates at the block level and doesn't understand ZFS's transactional semantics. During failover, in-flight transactions can lead to corrupted zpools. I've experienced this firsthand - the automatic failover would trigger while ZFS was still writing, resulting in an unmountable pool.

2. **ZFS-aware replication**: zrepl understands ZFS datasets and snapshots. It replicates at the dataset level, ensuring each snapshot is a consistent point-in-time copy. This is fundamentally safer than block-level replication.

3. **Snapshot history**: With zrepl, you get multiple recovery points (every minute for NFS data in our setup). If corruption occurs, you can roll back to any previous snapshot. HAST only gives you the current state.

4. **Easier recovery**: When something goes wrong with zrepl, you still have intact snapshots on both sides. With HAST, a corrupted primary often means a corrupted secondary too.

5. **Network flexibility**: zrepl works over any TCP connection (in our case, WireGuard), while HAST requires dedicated network configuration.

The 5-minute replication window is perfectly acceptable for my personal use cases. This isn't a high-frequency trading system or a real-time database - it's storage for personal projects, development work, and home lab experiments. Losing at most 5 minutes of work in a disaster scenario is a reasonable trade-off for the reliability and simplicity of snapshot-based replication.

### Installing zrepl

First, install zrepl on both hosts:

```
# On f0
paul@f0:~ % doas pkg install -y zrepl

# On f1
paul@f1:~ % doas pkg install -y zrepl
```

### Checking ZFS pools

Verify the pools and datasets on both hosts:

```sh
# On f0
paul@f0:~ % doas zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zdata   928G  1.03M   928G        -         -     0%     0%  1.00x    ONLINE  -
zroot   472G  26.7G   445G        -         -     0%     5%  1.00x    ONLINE  -

paul@f0:~ % doas zfs list -r zdata/enc
NAME        USED  AVAIL  REFER  MOUNTPOINT
zdata/enc   200K   899G   200K  /data/enc

# On f1
paul@f1:~ % doas zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zdata   928G   956K   928G        -         -     0%     0%  1.00x    ONLINE  -
zroot   472G  11.7G   460G        -         -     0%     2%  1.00x    ONLINE  -

paul@f1:~ % doas zfs list -r zdata/enc
NAME        USED  AVAIL  REFER  MOUNTPOINT
zdata/enc   200K   899G   200K  /data/enc
```

### Configuring zrepl with WireGuard tunnel

Since we have a WireGuard tunnel between f0 and f1, we'll use TCP transport over the secure tunnel instead of SSH. First, check the WireGuard IP addresses:

```sh
# Check WireGuard interface IPs
paul@f0:~ % ifconfig wg0 | grep inet
	inet 192.168.2.130 netmask 0xffffff00

paul@f1:~ % ifconfig wg0 | grep inet
	inet 192.168.2.131 netmask 0xffffff00
```

### Configuring zrepl on f0 (source)

First, create a dedicated dataset for NFS data that will be replicated:

```sh
# Create the nfsdata dataset that will hold all data exposed via NFS
paul@f0:~ % doas zfs create zdata/enc/nfsdata
```

Create the zrepl configuration on f0:

```sh
paul@f0:~ % doas tee /usr/local/etc/zrepl/zrepl.yml <<'EOF'
global:
  logging:
    - type: stdout
      level: info
      format: human

jobs:
  - name: f0_to_f1_nfsdata
    type: push
    connect:
      type: tcp
      address: "192.168.2.131:8888"
    filesystems:
      "zdata/enc/nfsdata": true
    send:
      encrypted: true
    snapshotting:
      type: periodic
      prefix: zrepl_
      interval: 1m
    pruning:
      keep_sender:
        - type: last_n
          count: 10
      keep_receiver:
        - type: last_n
          count: 10

  - name: f0_to_f1_fedora
    type: push
    connect:
      type: tcp
      address: "192.168.2.131:8888"
    filesystems:
      "zroot/bhyve/fedora": true
    send:
      encrypted: true
    snapshotting:
      type: periodic
      prefix: zrepl_
      interval: 10m
    pruning:
      keep_sender:
        - type: last_n
          count: 10
      keep_receiver:
        - type: last_n
          count: 10
EOF
```

Key configuration notes:
* We're using two separate replication jobs with different intervals:
  - `f0_to_f1_nfsdata`: Replicates NFS data every minute for faster failover recovery
  - `f0_to_f1_fedora`: Replicates Fedora VM every 10 minutes (less critical for NFS operations)
* We're specifically replicating `zdata/enc/nfsdata` instead of the entire `zdata/enc` dataset. This dedicated dataset will contain all the data we later want to expose via NFS, keeping a clear separation between replicated NFS data and other local encrypted data.
* The `send: encrypted: true` option uses ZFS native encryption for the replication stream. While this adds CPU overhead, it ensures the data remains encrypted in transit. Since we're already using a WireGuard tunnel, you could optionally remove this for better performance if your security requirements allow.

### Configuring zrepl on f1 (sink)

Create the zrepl configuration on f1:

```sh
# First create a dedicated sink dataset
paul@f1:~ % doas zfs create zdata/sink

paul@f1:~ % doas tee /usr/local/etc/zrepl/zrepl.yml <<'EOF'
global:
  logging:
    - type: stdout
      level: info
      format: human

jobs:
  - name: "sink"
    type: sink
    serve:
      type: tcp
      listen: "192.168.2.131:8888"
      clients:
        "192.168.2.130": "f0"
    recv:
      placeholder:
        encryption: inherit
    root_fs: "zdata/sink"
EOF
```

### Enabling and starting zrepl services

Enable and start zrepl on both hosts:

```sh
# On f0
paul@f0:~ % doas sysrc zrepl_enable=YES
zrepl_enable:  -> YES
paul@f0:~ % doas service zrepl start
Starting zrepl.

# On f1
paul@f1:~ % doas sysrc zrepl_enable=YES
zrepl_enable:  -> YES
paul@f1:~ % doas service zrepl start
Starting zrepl.
```

### Verifying replication

Check the replication status:

```sh
# On f0, check zrepl status (use raw mode for non-tty)
paul@f0:~ % doas zrepl status --mode raw | grep -A2 "Replication"
"Replication":{"StartAt":"2025-07-01T22:31:48.712143123+03:00"...

# Check if services are running
paul@f0:~ % doas service zrepl status
zrepl is running as pid 2649.

paul@f1:~ % doas service zrepl status
zrepl is running as pid 2574.

# Check for zrepl snapshots on source
paul@f0:~ % doas zfs list -t snapshot -r zdata/enc | grep zrepl
zdata/enc@zrepl_20250701_193148_000    0B      -   176K  -

# On f1, verify the replicated datasets  
paul@f1:~ % doas zfs list -r zdata | grep f0
zdata/f0             576K   899G   200K  none
zdata/f0/zdata       376K   899G   200K  none
zdata/f0/zdata/enc   176K   899G   176K  none

# Check replicated snapshots on f1
paul@f1:~ % doas zfs list -t snapshot -r zdata | grep zrepl
zdata/f0/zdata/enc@zrepl_20250701_193148_000     0B      -   176K  -
zdata/f0/zdata/enc@zrepl_20250701_194148_000     0B      -   176K  -
```

### Monitoring replication

You can monitor the replication progress with:

```sh
# Real-time status
paul@f0:~ % doas zrepl status --mode interactive

# Check specific job details
paul@f0:~ % doas zrepl status --job f0_to_f1
```

With this setup, both `zdata/enc/nfsdata` and `zroot/bhyve/fedora` on f0 will be automatically replicated to f1 every 5 minutes, with encrypted snapshots preserved on both sides. The pruning policy ensures that we keep the last 10 snapshots while managing disk space efficiently.

The replicated data appears on f1 under `zdata/sink/` with the source host and dataset hierarchy preserved:

* `zdata/enc/nfsdata` → `zdata/sink/f0/zdata/enc/nfsdata`
* `zroot/bhyve/fedora` → `zdata/sink/f0/zroot/bhyve/fedora`

This is by design - zrepl preserves the complete path from the source to ensure there are no conflicts when replicating from multiple sources. The replication uses the WireGuard tunnel for secure, encrypted transport between nodes.

### A note about the Bhyve VM replication

While replicating a Bhyve VM (Fedora in this case) is slightly off-topic for the f3s series, I've included it here as it demonstrates zrepl's flexibility. This is a development VM I use occasionally to log in remotely for certain development tasks. Having it replicated ensures I have a backup copy available on f1 if needed.

### Quick status check commands

Here are the essential commands to monitor replication status:

```sh
# On the source node (f0) - check if replication is active
paul@f0:~ % doas zrepl status --job f0_to_f1 | grep -E '(State|Last)'
State: done
LastError: 

# List all zrepl snapshots on source
paul@f0:~ % doas zfs list -t snapshot | grep zrepl
zdata/enc/nfsdata@zrepl_20250701_202530_000             0B      -   200K  -
zroot/bhyve/fedora@zrepl_20250701_202530_000            0B      -  2.97G  -

# On the sink node (f1) - verify received datasets
paul@f1:~ % doas zfs list -r zdata/sink
NAME                                   USED  AVAIL  REFER  MOUNTPOINT
zdata/sink                             3.0G   896G   200K  /data/sink
zdata/sink/f0                          3.0G   896G   200K  none
zdata/sink/f0/zdata                    472K   896G   200K  none
zdata/sink/f0/zdata/enc                272K   896G   200K  none
zdata/sink/f0/zdata/enc/nfsdata        176K   896G   176K  none
zdata/sink/f0/zroot                    2.9G   896G   200K  none
zdata/sink/f0/zroot/bhyve              2.9G   896G   200K  none
zdata/sink/f0/zroot/bhyve/fedora       2.9G   896G  2.97G  none

# Check received snapshots on sink
paul@f1:~ % doas zfs list -t snapshot -r zdata/sink | grep zrepl | wc -l
       3

# Monitor replication progress in real-time (on source)
paul@f0:~ % doas zrepl status --mode interactive

# Check last replication time (on source)
paul@f0:~ % doas zrepl status --job f0_to_f1 | grep -A1 "Replication"
Replication:
  Status: Idle (last run: 2025-07-01T22:41:48)

# View zrepl logs for troubleshooting
paul@f0:~ % doas tail -20 /var/log/zrepl.log | grep -E '(error|warn|replication)'
```

These commands provide a quick way to verify that:

* Replication jobs are running without errors
* Snapshots are being created on the source
* Data is being received on the sink
* The replication schedule is being followed

### Verifying replication after reboot

The zrepl service is configured to start automatically at boot. After rebooting both hosts:

```sh
paul@f0:~ % uptime
11:17PM  up 1 min, 0 users, load averages: 0.16, 0.06, 0.02

paul@f0:~ % doas service zrepl status
zrepl is running as pid 2366.

paul@f1:~ % doas service zrepl status
zrepl is running as pid 2309.

# Check that new snapshots are being created and replicated
paul@f0:~ % doas zfs list -t snapshot | grep zrepl | tail -2
zdata/enc/nfsdata@zrepl_20250701_202530_000                0B      -   200K  -
zroot/bhyve/fedora@zrepl_20250701_202530_000               0B      -  2.97G  -

paul@f1:~ % doas zfs list -t snapshot -r zdata/sink | grep 202530
zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_202530_000      0B      -   176K  -
zdata/sink/f0/zroot/bhyve/fedora@zrepl_20250701_202530_000     0B      -  2.97G  -
```

The timestamps confirm that replication resumed automatically after the reboot, ensuring continuous data protection.

### Important note about failover limitations

The current zrepl setup provides **backup/disaster recovery** but not automatic failover. The replicated datasets on f1 are not mounted by default (`mountpoint=none`). In case f0 fails:

```sh
# Manual steps needed on f1 to activate the replicated data:
paul@f1:~ % doas zfs set mountpoint=/data/nfsdata zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata
```

However, this creates a **split-brain problem**: when f0 comes back online, both systems would have diverged data. Resolving this requires careful manual intervention to:

1. Stop the original replication
2. Sync changes from f1 back to f0
3. Re-establish normal replication

For true high-availability NFS, you might consider:

* **Shared storage** (like iSCSI) with proper clustering
* **GlusterFS** or similar distributed filesystems
* **Manual failover with ZFS replication** (as we have here)

Note: While HAST+CARP is often suggested for HA storage, it can cause filesystem corruption in practice, especially with ZFS. The block-level replication of HAST doesn't understand ZFS's transactional model, leading to inconsistent states during failover. 

The current zrepl setup, despite requiring manual intervention, is actually safer because:

* ZFS snapshots are always consistent
* Replication is ZFS-aware (not just block-level)
* You have full control over the failover process
* No risk of split-brain corruption

### Mounting the NFS datasets

To make the nfsdata accessible on both nodes, we need to mount them. On f0, this is straightforward:

```sh
# On f0 - set mountpoint for the primary nfsdata
paul@f0:~ % doas zfs set mountpoint=/data/nfs zdata/enc/nfsdata
paul@f0:~ % doas mkdir -p /data/nfs

# Verify it's mounted
paul@f0:~ % df -h /data/nfs
Filesystem           Size    Used   Avail Capacity  Mounted on
zdata/enc/nfsdata    899G    204K    899G     0%    /data/nfs
```

On f1, we need to handle the encryption key and mount the standby copy:

```sh
# On f1 - first check encryption status
paul@f1:~ % doas zfs get keystatus zdata/sink/f0/zdata/enc/nfsdata
NAME                             PROPERTY   VALUE        SOURCE
zdata/sink/f0/zdata/enc/nfsdata  keystatus  unavailable  -

# Load the encryption key (using f0's key stored on the USB)
paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
    zdata/sink/f0/zdata/enc/nfsdata

# Set mountpoint and mount (same path as f0 for easier failover)
paul@f1:~ % doas mkdir -p /data/nfs
paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata

# Make it read-only to prevent accidental writes that would break replication
paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata

# Verify
paul@f1:~ % df -h /data/nfs
Filesystem                         Size    Used   Avail Capacity  Mounted on
zdata/sink/f0/zdata/enc/nfsdata    896G    204K    896G     0%    /data/nfs
```

Note: The dataset is mounted at the same path (`/data/nfs`) on both hosts to simplify failover procedures. The dataset on f1 is set to `readonly=on` to prevent accidental modifications that would break replication.

**CRITICAL WARNING**: Do NOT write to `/data/nfs/` on f1! Any modifications will break the replication. If you accidentally write to it, you'll see this error:

```
cannot receive incremental stream: destination zdata/sink/f0/zdata/enc/nfsdata has been modified
since most recent snapshot
```

To fix a broken replication after accidental writes:
```sh
# Option 1: Rollback to the last common snapshot (loses local changes)
paul@f1:~ % doas zfs rollback zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250701_204054_000

# Option 2: Make it read-only to prevent accidents
paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
```

### Failback scenario: Syncing changes from f1 back to f0

In a disaster recovery scenario where f0 has failed and f1 has taken over, you'll need to sync changes back when f0 returns. Here's how to failback:

```sh
# On f1: First, make the dataset writable (if it was readonly)
paul@f1:~ % doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata

# Create a snapshot of the current state
paul@f1:~ % doas zfs snapshot zdata/sink/f0/zdata/enc/nfsdata@failback

# On f0: Stop any services using the dataset
paul@f0:~ % doas service nfsd stop  # If NFS is running

# Send the snapshot from f1 to f0, forcing a rollback
# This WILL DESTROY any data on f0 that's not on f1!
paul@f1:~ % doas zfs send -R zdata/sink/f0/zdata/enc/nfsdata@failback | \
    ssh f0 "doas zfs recv -F zdata/enc/nfsdata"

# Alternative: If you want to see what would be received first
paul@f1:~ % doas zfs send -R zdata/sink/f0/zdata/enc/nfsdata@failback | \
    ssh f0 "doas zfs recv -nv -F zdata/enc/nfsdata"

# After successful sync, on f0:
paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@failback

# On f1: Make it readonly again and destroy the failback snapshot
paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@failback

# Stop zrepl services first - CRITICAL!
paul@f0:~ % doas service zrepl stop
paul@f1:~ % doas service zrepl stop

# Clean up any zrepl snapshots on f0
paul@f0:~ % doas zfs list -t snapshot -r zdata/enc/nfsdata | grep zrepl | \
    awk '{print $1}' | xargs -I {} doas zfs destroy {}

# Clean up and destroy the entire replicated structure on f1
# First release any holds
paul@f1:~ % doas zfs holds -r zdata/sink/f0 | grep -v NAME | \
    awk '{print $2, $1}' | while read tag snap; do 
        doas zfs release "$tag" "$snap"
    done

# Then destroy the entire f0 tree
paul@f1:~ % doas zfs destroy -rf zdata/sink/f0

# Create parent dataset structure on f1
paul@f1:~ % doas zfs create -p zdata/sink/f0/zdata/enc

# Create a fresh manual snapshot to establish baseline
paul@f0:~ % doas zfs snapshot zdata/enc/nfsdata@manual_baseline

# Send this snapshot to f1
paul@f0:~ % doas zfs send -w zdata/enc/nfsdata@manual_baseline | \
    ssh f1 "doas zfs recv zdata/sink/f0/zdata/enc/nfsdata"

# Clean up the manual snapshot
paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@manual_baseline
paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@manual_baseline

# Set mountpoint and make readonly on f1
paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata

# Load encryption key and mount on f1
paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
    zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata

# Now restart zrepl services
paul@f0:~ % doas service zrepl start
paul@f1:~ % doas service zrepl start

# Verify replication is working
paul@f0:~ % doas zrepl status --job f0_to_f1
```

**Important notes about failback**:

* The `-F` flag forces a rollback on f0, destroying any local changes
* Replication often won't resume automatically after a forced receive
* You must clean up old zrepl snapshots on both sides
* Creating a manual snapshot helps re-establish the replication relationship
* Always verify replication status after the failback procedure
* The first replication after failback will be a full send of the current state

### Testing the failback scenario

Here's a real test of the failback procedure:

```sh
# Simulate failure: Stop replication on f0
paul@f0:~ % doas service zrepl stop

# On f1: Take over by making the dataset writable
paul@f1:~ % doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata

# Write some data on f1 during the "outage"
paul@f1:~ % echo 'Data written on f1 during failover' | doas tee /data/nfs/failover-data.txt
Data written on f1 during failover

# Now perform failback when f0 comes back online
# Create snapshot on f1
paul@f1:~ % doas zfs snapshot zdata/sink/f0/zdata/enc/nfsdata@failback

# Send data back to f0 (note: we had to send to a temporary dataset due to holds)
paul@f1:~ % doas zfs send -Rw zdata/sink/f0/zdata/enc/nfsdata@failback | \
    ssh f0 "doas zfs recv -F zdata/enc/nfsdata_temp"

# On f0: Rename datasets to complete failback
paul@f0:~ % doas zfs set mountpoint=none zdata/enc/nfsdata
paul@f0:~ % doas zfs rename zdata/enc/nfsdata zdata/enc/nfsdata_old
paul@f0:~ % doas zfs rename zdata/enc/nfsdata_temp zdata/enc/nfsdata

# Load encryption key and mount
paul@f0:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/enc/nfsdata
paul@f0:~ % doas zfs mount zdata/enc/nfsdata

# Verify the data from f1 is now on f0
paul@f0:~ % ls -la /data/nfs/
total 18
drwxr-xr-x  2 root wheel  4 Jul  2 00:01 .
drwxr-xr-x  4 root wheel  4 Jul  1 23:41 ..
-rw-r--r--  1 root wheel 35 Jul  2 00:01 failover-data.txt
-rw-r--r--  1 root wheel 12 Jul  1 23:34 hello.txt
```

Success! The failover data from f1 is now on f0. To resume normal replication, you would need to:

1. Clean up old snapshots on both sides
2. Create a new manual baseline snapshot
3. Restart zrepl services

**Key learnings from the test**:

* The `-w` flag is essential for encrypted datasets
* Dataset holds can complicate the process (consider sending to a temporary dataset)
* The encryption key must be loaded after receiving the dataset
* Always verify data integrity before resuming normal operations

### Troubleshooting: Files not appearing in replication

If you write files to `/data/nfs/` on f0 but they don't appear on f1, check:

```sh
# 1. Is the dataset actually mounted on f0?
paul@f0:~ % doas zfs list -o name,mountpoint,mounted | grep nfsdata
zdata/enc/nfsdata                             /data/nfs             yes

# If it shows "no", the dataset isn't mounted!
# This means files are being written to the root filesystem, not ZFS

# 2. Check if encryption key is loaded
paul@f0:~ % doas zfs get keystatus zdata/enc/nfsdata
NAME               PROPERTY   VALUE        SOURCE
zdata/enc/nfsdata  keystatus  available    -

# If "unavailable", load the key:
paul@f0:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/enc/nfsdata
paul@f0:~ % doas zfs mount zdata/enc/nfsdata

# 3. Verify files are in the snapshot (not just the directory)
paul@f0:~ % ls -la /data/nfs/.zfs/snapshot/zrepl_*/
```

This issue commonly occurs after reboot if the encryption keys aren't configured to load automatically.

### Configuring automatic key loading on boot

To ensure all encrypted datasets are mounted automatically after reboot:

```sh
# On f0 - configure all encrypted datasets
paul@f0:~ % doas sysrc zfskeys_enable=YES
zfskeys_enable: NO -> YES
paul@f0:~ % doas sysrc zfskeys_datasets="zdata/enc zdata/enc/nfsdata zroot/bhyve"
zfskeys_datasets:  -> zdata/enc zdata/enc/nfsdata zroot/bhyve

# Set correct key locations for all datasets
paul@f0:~ % doas zfs set keylocation=file:///keys/f0.lan.buetow.org:zdata.key zdata/enc/nfsdata

# On f1 - include the replicated dataset
paul@f1:~ % doas sysrc zfskeys_enable=YES
zfskeys_enable: NO -> YES
paul@f1:~ % doas sysrc zfskeys_datasets="zdata/enc zroot/bhyve zdata/sink/f0/zdata/enc/nfsdata"
zfskeys_datasets:  -> zdata/enc zroot/bhyve zdata/sink/f0/zdata/enc/nfsdata

# Set key location for replicated dataset
paul@f1:~ % doas zfs set keylocation=file:///keys/f0.lan.buetow.org:zdata.key zdata/sink/f0/zdata/enc/nfsdata
```

Important notes:
* Each encryption root needs its own key load entry - child datasets don't inherit key loading
* The replicated dataset on f1 uses the same encryption key as the source on f0
* Always verify datasets are mounted after reboot with `zfs list -o name,mounted`
* **Critical**: Always ensure the replicated dataset on f1 remains read-only with `doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata`

### Troubleshooting: Replication broken due to modified destination

If you see the error "cannot receive incremental stream: destination has been modified since most recent snapshot", it means the read-only flag was accidentally removed on f1. To fix without a full resync:

```sh
# Stop zrepl on both servers
paul@f0:~ % doas service zrepl stop
paul@f1:~ % doas service zrepl stop

# Find the last common snapshot
paul@f0:~ % doas zfs list -t snapshot -o name,creation zdata/enc/nfsdata
paul@f1:~ % doas zfs list -t snapshot -o name,creation zdata/sink/f0/zdata/enc/nfsdata

# Rollback f1 to the last common snapshot (example: @zrepl_20250705_000007_000)
paul@f1:~ % doas zfs rollback -r zdata/sink/f0/zdata/enc/nfsdata@zrepl_20250705_000007_000

# Ensure the dataset is read-only
paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata

# Restart zrepl
paul@f0:~ % doas service zrepl start
paul@f1:~ % doas service zrepl start
```

### Forcing a full resync

If replication gets out of sync and incremental updates fail:

```sh
# Stop services
paul@f0:~ % doas service zrepl stop
paul@f1:~ % doas service zrepl stop

# On f1: Release holds and destroy the dataset
paul@f1:~ % doas zfs holds -r zdata/sink/f0/zdata/enc/nfsdata | \
    grep -v NAME | awk '{print $2, $1}' | \
    while read tag snap; do doas zfs release "$tag" "$snap"; done
paul@f1:~ % doas zfs destroy -rf zdata/sink/f0/zdata/enc/nfsdata

# On f0: Create fresh snapshot
paul@f0:~ % doas zfs snapshot zdata/enc/nfsdata@resync

# Send full dataset
paul@f0:~ % doas zfs send -Rw zdata/enc/nfsdata@resync | \
    ssh f1 "doas zfs recv zdata/sink/f0/zdata/enc/nfsdata"

# Configure f1
paul@f1:~ % doas zfs set mountpoint=/data/nfs zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs set readonly=on zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key \
    zdata/sink/f0/zdata/enc/nfsdata
paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata

# Clean up and restart
paul@f0:~ % doas zfs destroy zdata/enc/nfsdata@resync
paul@f1:~ % doas zfs destroy zdata/sink/f0/zdata/enc/nfsdata@resync
paul@f0:~ % doas service zrepl start
paul@f1:~ % doas service zrepl start
```

ZFS auto scrubbing....~?

Backup of the keys on the key locations (all keys on all 3 USB keys)

## Future Storage Explorations

While zrepl provides excellent snapshot-based replication for disaster recovery, there are other storage technologies worth exploring for the f3s project:

### MinIO for S3-Compatible Object Storage

MinIO is a high-performance, S3-compatible object storage system that could complement our ZFS-based storage. Some potential use cases:

* **S3 API compatibility**: Many modern applications expect S3-style object storage APIs. MinIO could provide this interface while using our ZFS storage as the backend.
* **Multi-site replication**: MinIO supports active-active replication across multiple sites, which could work well with our f0/f1/f2 node setup.
* **Kubernetes native**: MinIO has excellent Kubernetes integration with operators and CSI drivers, making it ideal for the f3s k3s environment.

### MooseFS for Distributed High Availability

MooseFS is a fault-tolerant, distributed file system that could provide true high-availability storage:

* **True HA**: Unlike our current setup which requires manual failover, MooseFS provides automatic failover with no single point of failure.
* **POSIX compliance**: Applications can use MooseFS like any regular filesystem, no code changes needed.
* **Flexible redundancy**: Configure different replication levels per directory or file, optimizing storage efficiency.
* **FreeBSD support**: MooseFS has native FreeBSD support, making it a natural fit for the f3s project.

Both technologies could potentially run on top of our encrypted ZFS volumes, combining ZFS's data integrity and encryption features with distributed storage capabilities. This would be particularly interesting for workloads that need either S3-compatible APIs (MinIO) or transparent distributed POSIX storage (MooseFS).

## NFS Server Configuration

With ZFS replication in place, we can now set up NFS servers on both f0 and f1 to export the replicated data. Since native NFS over TLS (RFC 9289) has compatibility issues between Linux and FreeBSD, we'll use stunnel to provide encryption.

### Setting up NFS on f0 (Primary)

First, enable the NFS services in rc.conf:

```sh
paul@f0:~ % doas sysrc nfs_server_enable=YES
nfs_server_enable: YES -> YES
paul@f0:~ % doas sysrc nfsv4_server_enable=YES
nfsv4_server_enable: YES -> YES
paul@f0:~ % doas sysrc nfsuserd_enable=YES
nfsuserd_enable: YES -> YES
paul@f0:~ % doas sysrc mountd_enable=YES
mountd_enable: NO -> YES
paul@f0:~ % doas sysrc rpcbind_enable=YES
rpcbind_enable: NO -> YES
```

Create a dedicated directory for Kubernetes volumes:

```sh
# First ensure the dataset is mounted
paul@f0:~ % doas zfs get mounted zdata/enc/nfsdata
NAME               PROPERTY  VALUE    SOURCE
zdata/enc/nfsdata  mounted   yes      -

# Create the k3svolumes directory
paul@f0:~ % doas mkdir -p /data/nfs/k3svolumes
paul@f0:~ % doas chmod 755 /data/nfs/k3svolumes

# This directory will be replicated to f1 automatically
```

Create the /etc/exports file to restrict Kubernetes nodes to only mount the k3svolumes subdirectory, while allowing the laptop full access. Since we're using stunnel, connections appear to come from localhost, so we must allow 127.0.0.1:

```sh
paul@f0:~ % doas tee /etc/exports <<'EOF'
V4: /data/nfs -sec=sys
/data/nfs/k3svolumes -maproot=root -network 192.168.1.120 -mask 255.255.255.255
/data/nfs/k3svolumes -maproot=root -network 192.168.1.121 -mask 255.255.255.255
/data/nfs/k3svolumes -maproot=root -network 192.168.1.122 -mask 255.255.255.255
/data/nfs/k3svolumes -maproot=root -network 127.0.0.1 -mask 255.255.255.255
/data/nfs -alldirs -maproot=root -network 192.168.1.4 -mask 255.255.255.255
/data/nfs -alldirs -maproot=root -network 192.168.1.22 -mask 255.255.255.255
/data/nfs -alldirs -maproot=root -network 127.0.0.1 -mask 255.255.255.255
EOF
```

The exports configuration:

* `V4: /data/nfs -sec=sys`: Sets the NFSv4 root directory to /data/nfs
* `/data/nfs/k3svolumes`: Specific subdirectory for Kubernetes volumes only
* `/data/nfs -alldirs`: Full access to all directories for the laptop and localhost
* `-maproot=root`: Map root user from client to root on server (needed for Kubernetes)
* `-network` and `-mask`: Restrict access to specific IPs:
  * 192.168.1.120 (r0.lan) - k3svolumes only
  * 192.168.1.121 (r1.lan) - k3svolumes only
  * 192.168.1.122 (r2.lan) - k3svolumes only
  * 127.0.0.1 (localhost) - needed for stunnel connections
  * 192.168.1.4 (laptop WiFi) - full access to /data/nfs
  * 192.168.1.22 (laptop Ethernet) - full access to /data/nfs

Note: 
* **Critical**: 127.0.0.1 must be allowed because stunnel proxies connections through localhost
* With NFSv4, clients mount using relative paths (e.g., `/k3svolumes` instead of `/data/nfs/k3svolumes`)
* The CARP virtual IP (192.168.1.138) is not included - it's the server's IP, not a client

Start the NFS services:

```sh
paul@f0:~ % doas service rpcbind start
Starting rpcbind.
paul@f0:~ % doas service mountd start
Starting mountd.
paul@f0:~ % doas service nfsd start
Starting nfsd.
paul@f0:~ % doas service nfsuserd start
Starting nfsuserd.
```

### Configuring Stunnel for NFS Encryption with CARP Failover

Since native NFS over TLS has compatibility issues between Linux and FreeBSD, we'll use stunnel to encrypt NFS traffic. Stunnel provides a transparent SSL/TLS tunnel for any TCP service. We'll configure stunnel to bind to the CARP virtual IP, ensuring automatic failover alongside NFS.

#### Creating a Certificate Authority for Client Authentication

First, create a CA to sign both server and client certificates:

```sh
# On f0 - Create CA
paul@f0:~ % doas mkdir -p /usr/local/etc/stunnel/ca
paul@f0:~ % cd /usr/local/etc/stunnel/ca
paul@f0:~ % doas openssl genrsa -out ca-key.pem 4096
paul@f0:~ % doas openssl req -new -x509 -days 3650 -key ca-key.pem -out ca-cert.pem \
  -subj '/C=US/ST=State/L=City/O=F3S Storage/CN=F3S Stunnel CA'

# Create server certificate
paul@f0:~ % cd /usr/local/etc/stunnel
paul@f0:~ % doas openssl genrsa -out server-key.pem 4096
paul@f0:~ % doas openssl req -new -key server-key.pem -out server.csr \
  -subj '/C=US/ST=State/L=City/O=F3S Storage/CN=f3s-storage-ha.lan'
paul@f0:~ % doas openssl x509 -req -days 3650 -in server.csr -CA ca/ca-cert.pem \
  -CAkey ca/ca-key.pem -CAcreateserial -out server-cert.pem

# Create client certificates for authorized clients
paul@f0:~ % cd /usr/local/etc/stunnel/ca
paul@f0:~ % doas sh -c 'for client in r0 r1 r2 earth; do 
  openssl genrsa -out ${client}-key.pem 4096
  openssl req -new -key ${client}-key.pem -out ${client}.csr \
    -subj "/C=US/ST=State/L=City/O=F3S Storage/CN=${client}.lan.buetow.org"
  openssl x509 -req -days 3650 -in ${client}.csr -CA ca-cert.pem \
    -CAkey ca-key.pem -CAcreateserial -out ${client}-cert.pem
done'
```

#### Install and Configure Stunnel on f0

```sh
# Install stunnel
paul@f0:~ % doas pkg install -y stunnel

# Configure stunnel server with client certificate authentication
paul@f0:~ % doas tee /usr/local/etc/stunnel/stunnel.conf <<'EOF'
cert = /usr/local/etc/stunnel/server-cert.pem
key = /usr/local/etc/stunnel/server-key.pem

setuid = stunnel
setgid = stunnel

[nfs-tls]
accept = 192.168.1.138:2323
connect = 127.0.0.1:2049
CAfile = /usr/local/etc/stunnel/ca/ca-cert.pem
verify = 2
requireCert = yes
EOF

# Enable and start stunnel
paul@f0:~ % doas sysrc stunnel_enable=YES
stunnel_enable:  -> YES
paul@f0:~ % doas service stunnel start
Starting stunnel.

# Restart stunnel to apply the CARP VIP binding
paul@f0:~ % doas service stunnel restart
Stopping stunnel.
Starting stunnel.
```

The configuration includes:
* `verify = 2`: Verify client certificate and fail if not provided
* `requireCert = yes`: Client must present a valid certificate
* `CAfile`: Path to the CA certificate that signed the client certificates

### Setting up NFS on f1 (Standby)

Repeat the same configuration on f1:

```sh
paul@f1:~ % doas sysrc nfs_server_enable=YES
nfs_server_enable: NO -> YES
paul@f1:~ % doas sysrc nfsv4_server_enable=YES
nfsv4_server_enable: NO -> YES
paul@f1:~ % doas sysrc nfsuserd_enable=YES
nfsuserd_enable: NO -> YES
paul@f1:~ % doas sysrc mountd_enable=YES
mountd_enable: NO -> YES
paul@f1:~ % doas sysrc rpcbind_enable=YES
rpcbind_enable: NO -> YES

paul@f1:~ % doas tee /etc/exports <<'EOF'
V4: /data/nfs -sec=sys
/data/nfs/k3svolumes -maproot=root -network 192.168.1.120 -mask 255.255.255.255
/data/nfs/k3svolumes -maproot=root -network 192.168.1.121 -mask 255.255.255.255
/data/nfs/k3svolumes -maproot=root -network 192.168.1.122 -mask 255.255.255.255
/data/nfs/k3svolumes -maproot=root -network 127.0.0.1 -mask 255.255.255.255
/data/nfs -alldirs -maproot=root -network 192.168.1.4 -mask 255.255.255.255
/data/nfs -alldirs -maproot=root -network 127.0.0.1 -mask 255.255.255.255
EOF

paul@f1:~ % doas service rpcbind start
Starting rpcbind.
paul@f1:~ % doas service mountd start
Starting mountd.
paul@f1:~ % doas service nfsd start
Starting nfsd.
paul@f1:~ % doas service nfsuserd start
Starting nfsuserd.
```

Configure stunnel on f1:

```sh
# Install stunnel
paul@f1:~ % doas pkg install -y stunnel

# Copy certificates from f0
paul@f0:~ % doas tar -cf /tmp/stunnel-certs.tar -C /usr/local/etc/stunnel server-cert.pem server-key.pem ca
paul@f0:~ % scp /tmp/stunnel-certs.tar f1:/tmp/
paul@f1:~ % cd /usr/local/etc/stunnel && doas tar -xf /tmp/stunnel-certs.tar

# Configure stunnel server on f1 with client certificate authentication
paul@f1:~ % doas tee /usr/local/etc/stunnel/stunnel.conf <<'EOF'
cert = /usr/local/etc/stunnel/server-cert.pem
key = /usr/local/etc/stunnel/server-key.pem

setuid = stunnel
setgid = stunnel

[nfs-tls]
accept = 192.168.1.138:2323
connect = 127.0.0.1:2049
CAfile = /usr/local/etc/stunnel/ca/ca-cert.pem
verify = 2
requireCert = yes
EOF

# Enable and start stunnel
paul@f1:~ % doas sysrc stunnel_enable=YES
stunnel_enable:  -> YES
paul@f1:~ % doas service stunnel start
Starting stunnel.

# Restart stunnel to apply the CARP VIP binding
paul@f1:~ % doas service stunnel restart
Stopping stunnel.
Starting stunnel.
```

### How Stunnel Works with CARP

With stunnel configured to bind to the CARP VIP (192.168.1.138), only the server that is currently the CARP MASTER will accept stunnel connections. This provides automatic failover for encrypted NFS:

* When f0 is CARP MASTER: stunnel on f0 accepts connections on 192.168.1.138:2323
* When f1 becomes CARP MASTER: stunnel on f1 starts accepting connections on 192.168.1.138:2323
* The backup server's stunnel process will fail to bind to the VIP and won't accept connections

This ensures that clients always connect to the active NFS server through the CARP VIP.

### CARP Control Script for Clean Failover

To ensure clean failover behavior and prevent stale file handles, we'll create a control script that:
- Stops NFS services on BACKUP nodes (preventing split-brain scenarios)
- Starts NFS services only on the MASTER node
- Manages stunnel binding to the CARP VIP

This approach ensures clients can only connect to the active server, eliminating stale handles from the inactive server:

```sh
# Create CARP control script on both f0 and f1
paul@f0:~ % doas tee /usr/local/bin/carpcontrol.sh <<'EOF'
#!/bin/sh
# CARP state change control script

case "$1" in
    MASTER)
        logger "CARP state changed to MASTER, starting services"
        service rpcbind start >/dev/null 2>&1
        service mountd start >/dev/null 2>&1
        service nfsd start >/dev/null 2>&1
        service nfsuserd start >/dev/null 2>&1
        service stunnel restart >/dev/null 2>&1
        logger "CARP MASTER: NFS and stunnel services started"
        ;;
    BACKUP)
        logger "CARP state changed to BACKUP, stopping services"
        service stunnel stop >/dev/null 2>&1
        service nfsd stop >/dev/null 2>&1
        service mountd stop >/dev/null 2>&1
        service nfsuserd stop >/dev/null 2>&1
        logger "CARP BACKUP: NFS and stunnel services stopped"
        ;;
    *)
        logger "CARP state changed to $1 (unhandled)"
        ;;
esac
EOF

paul@f0:~ % doas chmod +x /usr/local/bin/carpcontrol.sh

# Add to devd configuration
paul@f0:~ % doas tee -a /etc/devd.conf <<'EOF'

# CARP state change notifications
notify 0 {
    match "system" "CARP";
    match "subsystem" "[0-9]+@[a-z]+[0-9]+";
    match "type" "(MASTER|BACKUP)";
    action "/usr/local/bin/carpcontrol.sh $type";
};
EOF

# Restart devd to apply changes
paul@f0:~ % doas service devd restart
```

This enhanced script ensures that:
- Only the MASTER node runs NFS and stunnel services
- BACKUP nodes have all services stopped, preventing any client connections
- Failovers are clean with no possibility of accessing the wrong server
- Stale file handles are minimized because the old server immediately stops responding

### CARP Management Script

To simplify CARP state management and failover testing, create this helper script on both f0 and f1:

```sh
# Create the CARP management script
paul@f0:~ % doas tee /usr/local/bin/carp <<'EOF'
#!/bin/sh
# CARP state management script
# Usage: carp [master|backup]
# Without arguments: shows current state

# Find the interface with CARP configured
CARP_IF=$(ifconfig -l | xargs -n1 | while read if; do
    ifconfig "$if" 2>/dev/null | grep -q "carp:" && echo "$if" && break
done)

if [ -z "$CARP_IF" ]; then
    echo "Error: No CARP interface found"
    exit 1
fi

# Get CARP VHID
VHID=$(ifconfig "$CARP_IF" | grep "carp:" | sed -n 's/.*vhid \([0-9]*\).*/\1/p')

if [ -z "$VHID" ]; then
    echo "Error: Could not determine CARP VHID"
    exit 1
fi

# Function to get current state
get_state() {
    ifconfig "$CARP_IF" | grep "carp:" | awk '{print $2}'
}

# Main logic
case "$1" in
    "")
        # No argument - show current state
        STATE=$(get_state)
        echo "CARP state on $CARP_IF (vhid $VHID): $STATE"
        ;;
    master)
        # Force to MASTER state
        echo "Setting CARP to MASTER state..."
        ifconfig "$CARP_IF" vhid "$VHID" state master
        sleep 1
        STATE=$(get_state)
        echo "CARP state on $CARP_IF (vhid $VHID): $STATE"
        ;;
    backup)
        # Force to BACKUP state
        echo "Setting CARP to BACKUP state..."
        ifconfig "$CARP_IF" vhid "$VHID" state backup
        sleep 1
        STATE=$(get_state)
        echo "CARP state on $CARP_IF (vhid $VHID): $STATE"
        ;;
    *)
        echo "Usage: $0 [master|backup]"
        echo "  Without arguments: show current CARP state"
        echo "  master: force this node to become CARP MASTER"
        echo "  backup: force this node to become CARP BACKUP"
        exit 1
        ;;
esac
EOF

paul@f0:~ % doas chmod +x /usr/local/bin/carp

# Copy to f1 as well
paul@f0:~ % scp /usr/local/bin/carp f1:/tmp/
paul@f1:~ % doas cp /tmp/carp /usr/local/bin/carp && doas chmod +x /usr/local/bin/carp
```

Now you can easily manage CARP states:

```sh
# Check current CARP state
paul@f0:~ % doas carp
CARP state on re0 (vhid 1): MASTER

paul@f1:~ % doas carp
CARP state on re0 (vhid 1): BACKUP

# Force f0 to become BACKUP (triggers failover to f1)
paul@f0:~ % doas carp backup
Setting CARP to BACKUP state...
CARP state on re0 (vhid 1): BACKUP

# Force f0 to reclaim MASTER status
paul@f0:~ % doas carp master
Setting CARP to MASTER state...
CARP state on re0 (vhid 1): MASTER
```

This script makes failover testing much simpler than manually running `ifconfig` commands

### Verifying Stunnel and CARP Status

First, check which host is currently CARP MASTER:

```sh
# On f0 - check CARP status
paul@f0:~ % ifconfig re0 | grep carp
	inet 192.168.1.130 netmask 0xffffff00 broadcast 192.168.1.255
	inet 192.168.1.138 netmask 0xffffffff broadcast 192.168.1.138 vhid 1

# If f0 is MASTER, verify stunnel is listening on the VIP
paul@f0:~ % doas sockstat -l | grep 2323
stunnel  stunnel    1234  3  tcp4   192.168.1.138:2323    *:*

# On f1 - check CARP status  
paul@f1:~ % ifconfig re0 | grep carp
	inet 192.168.1.131 netmask 0xffffff00 broadcast 192.168.1.255

# If f1 is BACKUP, stunnel won't be able to bind to the VIP
paul@f1:~ % doas tail /var/log/messages | grep stunnel
Jul  4 12:34:56 f1 stunnel: [!] bind: 192.168.1.138:2323: Can't assign requested address (49)
```

### Verifying NFS Exports

Check that the exports are active on both servers:

```sh
# On f0
paul@f0:~ % doas showmount -e localhost
Exports list on localhost:
/data/nfs/k3svolumes               192.168.1.120 192.168.1.121 192.168.1.122
/data/nfs                          192.168.1.4

# On f1
paul@f1:~ % doas showmount -e localhost
Exports list on localhost:
/data/nfs/k3svolumes               192.168.1.120 192.168.1.121 192.168.1.122
/data/nfs                          192.168.1.4
```

### Client Configuration for Stunnel

To mount NFS shares with stunnel encryption, clients need to install and configure stunnel with their client certificates.

#### Preparing Client Certificates

On f0, prepare the client certificate packages:

```sh
# Create combined certificate/key files for each client
paul@f0:~ % cd /usr/local/etc/stunnel/ca
paul@f0:~ % doas sh -c 'for client in r0 r1 r2 earth; do
  cat ${client}-cert.pem ${client}-key.pem > /tmp/${client}-stunnel.pem
done'
```

#### Configuring Rocky Linux Clients (r0, r1, r2)

```sh
# Install stunnel on client (example for r0)
[root@r0 ~]# dnf install -y stunnel

# Copy client certificate and CA certificate from f0
[root@r0 ~]# scp f0:/tmp/r0-stunnel.pem /etc/stunnel/
[root@r0 ~]# scp f0:/usr/local/etc/stunnel/ca/ca-cert.pem /etc/stunnel/

# Configure stunnel client with certificate authentication
[root@r0 ~]# tee /etc/stunnel/stunnel.conf <<'EOF'
cert = /etc/stunnel/r0-stunnel.pem
CAfile = /etc/stunnel/ca-cert.pem
client = yes
verify = 2

[nfs-ha]
accept = 127.0.0.1:2323
connect = 192.168.1.138:2323
EOF

# Enable and start stunnel
[root@r0 ~]# systemctl enable --now stunnel

# Repeat for r1 and r2 with their respective certificates
```

Note: Each client must use its own certificate file (r0-stunnel.pem, r1-stunnel.pem, r2-stunnel.pem, or earth-stunnel.pem).

### Testing NFS Mount with Stunnel

Mount NFS through the stunnel encrypted tunnel:

```sh
# Create mount point
[root@r0 ~]# mkdir -p /data/nfs/k3svolumes

# Mount through stunnel (using localhost and NFSv4)
[root@r0 ~]# mount -t nfs4 -o port=2323 127.0.0.1:/data/nfs/k3svolumes /data/nfs/k3svolumes

# Verify mount
[root@r0 ~]# mount | grep k3svolumes
127.0.0.1:/data/nfs/k3svolumes on /data/nfs/k3svolumes type nfs4 (rw,relatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=2323,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

# For persistent mount, add to /etc/fstab:
127.0.0.1:/data/nfs/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,_netdev 0 0
```

Note: The mount uses localhost (127.0.0.1) because stunnel is listening locally and forwarding the encrypted traffic to the remote server.

Verify the file was written and replicated:

```sh
# Check on f0
paul@f0:~ % cat /data/nfs/test-r0.txt
Test from r0

# After replication interval (5 minutes), check on f1
paul@f1:~ % cat /data/nfs/test-r0.txt
Test from r0
```

### Important: Encryption Keys for Replicated Datasets

When using encrypted ZFS datasets with raw sends (send -w), the replicated datasets on f1 need the encryption keys loaded to access the data:

```sh
# Check encryption status on f1
paul@f1:~ % doas zfs get keystatus zdata/sink/f0/zdata/enc/nfsdata
NAME                             PROPERTY   VALUE        SOURCE
zdata/sink/f0/zdata/enc/nfsdata  keystatus  unavailable  -

# Load the encryption key (uses the same key as f0)
paul@f1:~ % doas zfs load-key -L file:///keys/f0.lan.buetow.org:zdata.key zdata/sink/f0/zdata/enc/nfsdata

# Mount the dataset
paul@f1:~ % doas zfs mount zdata/sink/f0/zdata/enc/nfsdata

# Configure automatic key loading on boot
paul@f1:~ % doas sysrc zfskeys_datasets="zdata/enc zroot/bhyve zdata/sink/f0/zdata/enc/nfsdata"
zfskeys_datasets:  -> zdata/enc zroot/bhyve zdata/sink/f0/zdata/enc/nfsdata
```

This ensures that after a reboot, f1 will automatically load the encryption keys and mount all encrypted datasets, including the replicated ones.

### NFS Failover with CARP and Stunnel

With NFS servers running on both f0 and f1 and stunnel bound to the CARP VIP:

* **Automatic failover**: When f0 fails, CARP automatically promotes f1 to MASTER
* **Stunnel failover**: The carpcontrol.sh script automatically starts stunnel on the new MASTER
* **Client transparency**: Clients always connect to 192.168.1.138:2323, which routes to the active server
* **No connection disruption**: Existing NFS mounts continue working through the same VIP
* **Data consistency**: ZFS replication ensures f1 has recent data (within 5-minute window)
* **Read-only replica**: The replicated dataset on f1 is always mounted read-only to prevent breaking replication
* **Manual intervention required for full RW failover**: When f1 becomes MASTER, you must:
  1. Stop zrepl to prevent conflicts: `doas service zrepl stop`
  2. Make the replicated dataset writable: `doas zfs set readonly=off zdata/sink/f0/zdata/enc/nfsdata`
  3. Ensure encryption keys are loaded (should be automatic with zfskeys_enable)
  4. NFS will automatically start serving read/write requests through the VIP

**Important**: The `/data/nfs` mount on f1 remains read-only during normal operation to ensure replication integrity. In case of a failover, clients can still read data immediately, but write operations require the manual steps above to promote f1 to full read-write mode.

### Testing CARP Failover

To test the failover process:

```sh
# On f0 (current MASTER) - trigger failover
paul@f0:~ % doas ifconfig re0 vhid 1 state backup

# On f1 - verify it becomes MASTER
paul@f1:~ % ifconfig re0 | grep carp
    inet 192.168.1.138 netmask 0xffffffff broadcast 192.168.1.138 vhid 1

# Check stunnel is now listening on f1
paul@f1:~ % doas sockstat -l | grep 2323
stunnel  stunnel    4567  3  tcp4   192.168.1.138:2323    *:*

# On client - verify NFS mount still works
[root@r0 ~]# ls /data/nfs/k3svolumes/
[root@r0 ~]# echo "Test after failover" > /data/nfs/k3svolumes/failover-test.txt
```

### Handling Stale File Handles After Failover

After a CARP failover, NFS clients may experience "Stale file handle" errors because they cached file handles from the previous server. To resolve this:

**Manual recovery (immediate fix):**
```sh
# Force unmount and remount
[root@r0 ~]# umount -f /data/nfs/k3svolumes
[root@r0 ~]# mount /data/nfs/k3svolumes
```

**Automatic recovery options:**

1. **Use soft mounts with shorter timeouts** in `/etc/fstab`:
```
127.0.0.1:/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,_netdev,soft,timeo=10,retrans=2,intr 0 0
```

2. **Create a monitoring script** that detects and fixes stale mounts:
```sh
#!/bin/bash
# /usr/local/bin/check-nfs-mount.sh
if ! ls /data/nfs/k3svolumes >/dev/null 2>&1; then
    echo "Stale NFS mount detected, remounting..."
    umount -f /data/nfs/k3svolumes
    mount /data/nfs/k3svolumes
fi
```

3. **For Kubernetes**, use liveness probes that restart pods when NFS becomes stale

**Note**: Stale file handles are inherent to NFS failover because file handles are server-specific. The best approach depends on your application's tolerance for brief disruptions.

### Complete Failover Test

Here's a comprehensive test of the failover behavior with all optimizations in place:

```sh
# 1. Check initial state
paul@f0:~ % ifconfig re0 | grep carp
    carp: MASTER vhid 1 advbase 1 advskew 0
paul@f1:~ % ifconfig re0 | grep carp
    carp: BACKUP vhid 1 advbase 1 advskew 0

# 2. Create a test file from a client
[root@r0 ~]# echo "test before failover" > /data/nfs/k3svolumes/test-before.txt

# 3. Trigger failover (f0 → f1)
paul@f0:~ % doas ifconfig re0 vhid 1 state backup

# 4. Monitor client behavior
[root@r0 ~]# ls /data/nfs/k3svolumes/
ls: cannot access '/data/nfs/k3svolumes/': Stale file handle

# 5. Check automatic recovery (within 1 minute)
[root@r0 ~]# tail -f /var/log/nfs-mount-check.log
Sat 5 Jul 13:56:02 EEST 2025: Stale NFS mount detected (exit code: 2), remounting...
Sat 5 Jul 13:56:02 EEST 2025: NFS remounted successfully
Sat 5 Jul 13:56:02 EEST 2025: Mount verified as working
```

**Failover Timeline:**
- **0 seconds**: CARP failover triggered
- **0-2 seconds**: Clients get "Stale file handle" errors (not hanging)
- **3-60 seconds**: Soft mounts ensure quick failure of operations
- **Within 60 seconds**: Automatic recovery via cron job

**Benefits of the Optimized Setup:**
1. **No hanging processes** - Soft mounts fail quickly
2. **Clean failover** - Old server stops serving immediately
3. **Automatic recovery** - No manual intervention needed
4. **Predictable timing** - Recovery within 1 minute maximum

**Important Considerations:**
- Recent writes (within 5 minutes) may not be visible after failover due to replication lag
- Applications should handle brief NFS errors gracefully
- For zero-downtime requirements, consider synchronous replication or distributed storage

### Verifying Replication Status

To check if replication is working correctly:

```sh
# Check replication status
paul@f0:~ % doas zrepl status

# Check recent snapshots on source
paul@f0:~ % doas zfs list -t snapshot -o name,creation zdata/enc/nfsdata | tail -5

# Check recent snapshots on destination
paul@f1:~ % doas zfs list -t snapshot -o name,creation zdata/sink/f0/zdata/enc/nfsdata | tail -5

# Verify data appears on f1 (should be read-only)
paul@f1:~ % ls -la /data/nfs/k3svolumes/
```

**Important**: If you see "connection refused" errors in zrepl logs, ensure:
- Both servers have zrepl running (`doas service zrepl status`)
- No firewall or hosts.allow rules are blocking port 8888
- WireGuard is up if using WireGuard IPs for replication

### Post-Reboot Verification

After rebooting the FreeBSD servers, verify the complete stack:

```sh
# Check CARP status on all servers
paul@f0:~ % ifconfig re0 | grep carp
paul@f1:~ % ifconfig re0 | grep carp

# Verify stunnel is running on the MASTER
paul@f0:~ % doas sockstat -l | grep 2323

# Check NFS is exported
paul@f0:~ % doas showmount -e localhost

# Verify all r servers have NFS mounted
[root@r0 ~]# mount | grep nfs
[root@r1 ~]# mount | grep nfs
[root@r2 ~]# mount | grep nfs

# Test write access
[root@r0 ~]# echo "Test after reboot $(date)" > /data/nfs/k3svolumes/test-reboot.txt

# Verify zrepl is running and replicating
paul@f0:~ % doas service zrepl status
paul@f1:~ % doas service zrepl status
```

### Integration with Kubernetes

In your Kubernetes manifests, you can now create PersistentVolumes using the NFS servers:

```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: 192.168.1.138  # f3s-storage-ha.lan (CARP virtual IP)
    path: /data/nfs/k3svolumes
  mountOptions:
    - nfsvers=4
    - tcp
    - hard
    - intr
```

Using the CARP virtual IP (192.168.1.138) instead of direct server IPs ensures that Kubernetes workloads continue to access storage even if the primary NFS server fails. For encryption, configure stunnel on the Kubernetes nodes.

### Security Benefits of Stunnel with Client Certificates

Using stunnel with client certificate authentication for NFS encryption provides several advantages:

* **Compatibility**: Works with any NFS version and between different operating systems
* **Strong encryption**: Uses TLS/SSL with configurable cipher suites
* **Transparent**: Applications don't need modification, encryption happens at transport layer
* **Performance**: Minimal overhead (~2% in benchmarks)
* **Flexibility**: Can encrypt any TCP-based protocol, not just NFS
* **Strong Authentication**: Client certificates provide cryptographic proof of identity
* **Access Control**: Only clients with valid certificates signed by your CA can connect
* **Certificate Revocation**: You can revoke access by removing certificates from the CA

### Laptop/Workstation Access

For development workstations like "earth" (laptop), the same stunnel configuration works, but there's an important caveat with NFSv4:

```sh
# Install stunnel
sudo dnf install stunnel

# Configure stunnel (/etc/stunnel/stunnel.conf)
cert = /etc/stunnel/earth-stunnel.pem
CAfile = /etc/stunnel/ca-cert.pem
client = yes
verify = 2

[nfs-ha]
accept = 127.0.0.1:2323
connect = 192.168.1.138:2323

# Enable and start stunnel
sudo systemctl enable --now stunnel

# Mount NFS through stunnel
sudo mount -t nfs4 -o port=2323 127.0.0.1:/ /data/nfs

# Make persistent in /etc/fstab
127.0.0.1:/ /data/nfs nfs4 port=2323,hard,intr,_netdev 0 0
```

#### Important: NFSv4 and Stunnel on Newer Linux Clients

On newer Linux distributions (like Fedora 42+), NFSv4 only uses the specified port for initial mount negotiation, but then establishes data connections directly to port 2049, bypassing stunnel. This doesn't occur on Rocky Linux 9 VMs, which properly route all traffic through the specified port. 

To ensure all NFS traffic goes through the encrypted tunnel on affected systems, you need to use iptables:

```sh
# Redirect all NFS traffic to the CARP VIP through stunnel
sudo iptables -t nat -A OUTPUT -d 192.168.1.138 -p tcp --dport 2049 -j DNAT --to-destination 127.0.0.1:2323

# Make it persistent (example for Fedora)
sudo dnf install iptables-services
sudo service iptables save
sudo systemctl enable iptables

# Or create a startup script
cat > ~/setup-nfs-stunnel.sh << 'EOF'
#!/bin/bash
# Ensure NFSv4 data connections go through stunnel
sudo iptables -t nat -D OUTPUT -d 192.168.1.138 -p tcp --dport 2049 -j DNAT --to-destination 127.0.0.1:2323 2>/dev/null
sudo iptables -t nat -A OUTPUT -d 192.168.1.138 -p tcp --dport 2049 -j DNAT --to-destination 127.0.0.1:2323
EOF
chmod +x ~/setup-nfs-stunnel.sh
```

To verify all traffic is encrypted:
```sh
# Check active connections
sudo ss -tnp | grep -E ":2049|:2323"
# You should see connections to localhost:2323 (stunnel), not direct to the CARP VIP

# Monitor stunnel logs
journalctl -u stunnel -f
# You should see connection logs for all NFS operations
```

**Note**: The laptop has full access to `/data/nfs` with the `-alldirs` export option, while Kubernetes nodes are restricted to `/data/nfs/k3svolumes`.

The client certificate requirement ensures that:
- Only authorized clients (r0, r1, r2, and earth) can establish stunnel connections
- Each client has a unique identity that can be individually managed
- Stolen IP addresses alone cannot grant access without the corresponding certificate
- Access can be revoked without changing the server configuration

The combination of ZFS encryption at rest and stunnel in transit ensures data is protected throughout its lifecycle.

This configuration provides a solid foundation for shared storage in the f3s Kubernetes cluster, with automatic replication and encrypted transport.

## Mounting NFS on Rocky Linux 9

### Installing and Configuring NFS Clients on r0, r1, and r2

First, install the necessary packages on all three Rocky Linux nodes:

```sh
# On r0, r1, and r2
dnf install -y nfs-utils stunnel
```

### Configuring Stunnel Client on All Nodes

Copy the certificate and configure stunnel on each Rocky Linux node:

```sh
# On r0
scp f0:/usr/local/etc/stunnel/stunnel.pem /etc/stunnel/
tee /etc/stunnel/stunnel.conf <<'EOF'
cert = /etc/stunnel/stunnel.pem
client = yes

[nfs-ha]
accept = 127.0.0.1:2323
connect = 192.168.1.138:2323
EOF

systemctl enable --now stunnel

# Repeat the same configuration on r1 and r2
```

### Setting Up NFS Mounts

Create mount points and configure persistent mounts on all nodes:

```sh
# On r0, r1, and r2
mkdir -p /data/nfs/k3svolumes

# Add to /etc/fstab for persistent mount (note the NFSv4 relative path)
echo '127.0.0.1:/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,hard,intr,_netdev 0 0' >> /etc/fstab

# Mount the share
mount /data/nfs/k3svolumes
```

### Comprehensive NFS Mount Testing

Here's a detailed test plan to verify NFS mounts are working correctly on all nodes:

#### Test 1: Verify Mount Status on All Nodes

```sh
# On r0
[root@r0 ~]# mount | grep k3svolumes
# Expected output:
# 127.0.0.1:/data/nfs/k3svolumes on /data/nfs/k3svolumes type nfs4 (rw,relatime,vers=4.2,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=2323,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

# On r1
[root@r1 ~]# mount | grep k3svolumes
# Should show similar output

# On r2
[root@r2 ~]# mount | grep k3svolumes
# Should show similar output
```

#### Test 2: Verify Stunnel Connectivity

```sh
# On r0
[root@r0 ~]# systemctl status stunnel
# Should show: Active: active (running)

[root@r0 ~]# ss -tnl | grep 2323
# Should show: LISTEN 0 128 127.0.0.1:2323 0.0.0.0:*

# Test connection to CARP VIP
[root@r0 ~]# nc -zv 192.168.1.138 2323
# Should show: Connection to 192.168.1.138 2323 port [tcp/*] succeeded!

# Repeat on r1 and r2
```

#### Test 3: File Creation and Visibility Test

```sh
# On r0 - Create test file
[root@r0 ~]# echo "Test from r0 - $(date)" > /data/nfs/k3svolumes/test-r0.txt
[root@r0 ~]# ls -la /data/nfs/k3svolumes/test-r0.txt
# Should show the file with timestamp

# On r1 - Create test file and check r0's file
[root@r1 ~]# echo "Test from r1 - $(date)" > /data/nfs/k3svolumes/test-r1.txt
[root@r1 ~]# ls -la /data/nfs/k3svolumes/
# Should show both test-r0.txt and test-r1.txt

# On r2 - Create test file and check all files
[root@r2 ~]# echo "Test from r2 - $(date)" > /data/nfs/k3svolumes/test-r2.txt
[root@r2 ~]# ls -la /data/nfs/k3svolumes/
# Should show all three files: test-r0.txt, test-r1.txt, test-r2.txt
```

#### Test 4: Verify Files on Storage Servers

```sh
# On f0 (primary storage)
paul@f0:~ % ls -la /data/nfs/k3svolumes/
# Should show all three test files

# Wait 5 minutes for replication, then check on f1
paul@f1:~ % ls -la /data/nfs/k3svolumes/
# Should show all three test files (after replication)
```

#### Test 5: Performance and Concurrent Access Test

```sh
# On r0 - Write large file
[root@r0 ~]# dd if=/dev/zero of=/data/nfs/k3svolumes/test-large-r0.dat bs=1M count=100
# Should complete without errors

# On r1 - Read the file while r2 writes
[root@r1 ~]# dd if=/data/nfs/k3svolumes/test-large-r0.dat of=/dev/null bs=1M &
# Simultaneously on r2
[root@r2 ~]# dd if=/dev/zero of=/data/nfs/k3svolumes/test-large-r2.dat bs=1M count=100

# Check for any errors or performance issues
```

#### Test 6: Directory Operations Test

```sh
# On r0 - Create directory structure
[root@r0 ~]# mkdir -p /data/nfs/k3svolumes/test-dir/subdir1/subdir2
[root@r0 ~]# echo "Deep file" > /data/nfs/k3svolumes/test-dir/subdir1/subdir2/deep.txt

# On r1 - Verify and add files
[root@r1 ~]# ls -la /data/nfs/k3svolumes/test-dir/subdir1/subdir2/
[root@r1 ~]# echo "Another file from r1" > /data/nfs/k3svolumes/test-dir/subdir1/file-r1.txt

# On r2 - Verify complete structure
[root@r2 ~]# find /data/nfs/k3svolumes/test-dir -type f
# Should show both files
```

#### Test 7: Permission and Ownership Test

```sh
# On r0 - Create files with different permissions
[root@r0 ~]# touch /data/nfs/k3svolumes/test-perms-644.txt
[root@r0 ~]# chmod 644 /data/nfs/k3svolumes/test-perms-644.txt
[root@r0 ~]# touch /data/nfs/k3svolumes/test-perms-755.txt
[root@r0 ~]# chmod 755 /data/nfs/k3svolumes/test-perms-755.txt

# On r1 and r2 - Verify permissions are preserved
[root@r1 ~]# ls -l /data/nfs/k3svolumes/test-perms-*.txt
[root@r2 ~]# ls -l /data/nfs/k3svolumes/test-perms-*.txt
# Permissions should match what was set on r0
```

#### Test 8: Failover Test (Optional but Recommended)

```sh
# On f0 - Trigger CARP failover
paul@f0:~ % doas ifconfig re0 vhid 1 state backup

# On all Rocky nodes - Verify mounts still work
[root@r0 ~]# echo "Test during failover from r0 - $(date)" > /data/nfs/k3svolumes/failover-test-r0.txt
[root@r1 ~]# echo "Test during failover from r1 - $(date)" > /data/nfs/k3svolumes/failover-test-r1.txt
[root@r2 ~]# echo "Test during failover from r2 - $(date)" > /data/nfs/k3svolumes/failover-test-r2.txt

# Verify all files are accessible
[root@r0 ~]# ls -la /data/nfs/k3svolumes/failover-test-*.txt

# On f1 - Verify it's now MASTER
paul@f1:~ % ifconfig re0 | grep carp
# Should show the VIP 192.168.1.138

# Restore f0 as MASTER
paul@f0:~ % doas ifconfig re0 vhid 1 state master
```

### Troubleshooting Common Issues

#### Mount Hangs or Times Out

```sh
# Check stunnel connectivity
systemctl status stunnel
ss -tnl | grep 2323
telnet 127.0.0.1 2323

# Check if you can reach the CARP VIP
ping 192.168.1.138
nc -zv 192.168.1.138 2323

# Check for firewall issues
iptables -L -n | grep 2323
```

#### Permission Denied Errors

```sh
# Verify the export allows your IP
# On f0 or f1
doas showmount -e localhost

# Check if SELinux is blocking (on Rocky Linux)
getenforce
# If enforcing, try:
setenforce 0  # Temporary for testing
# Or add proper SELinux context:
setsebool -P use_nfs_home_dirs 1
```

#### Files Not Visible Across Nodes

```sh
# Force NFS cache refresh
# On the affected node
umount /data/nfs/k3svolumes
mount /data/nfs/k3svolumes

# Check NFS version
nfsstat -m
# Should show NFSv4
```

#### I/O Errors When Accessing NFS Mount

I/O errors can have several causes:

1. **Missing localhost in exports** (most common with stunnel):
   - Since stunnel proxies connections, the NFS server sees requests from 127.0.0.1
   - Ensure your exports include localhost access:
   ```
   /data/nfs/k3svolumes -maproot=root -network 127.0.0.1 -mask 255.255.255.255
   ```

2. **Stunnel connection issues or CARP failover**:

```sh
# On the affected node (e.g., r0)
# Check stunnel is running
systemctl status stunnel

# Restart stunnel to re-establish connection
systemctl restart stunnel

# Force remount
umount -f -l /data/nfs/k3svolumes
mount -t nfs4 -o port=2323,hard,intr 127.0.0.1:/data/nfs/k3svolumes /data/nfs/k3svolumes

# Check which FreeBSD host is CARP MASTER
# On f0
ssh f0 "ifconfig re0 | grep carp"
# On f1
ssh f1 "ifconfig re0 | grep carp"

# Verify stunnel on MASTER is bound to VIP
# On the MASTER host
ssh <master-host> "sockstat -l | grep 2323"

# Debug stunnel connection
openssl s_client -connect 192.168.1.138:2323 </dev/null

# If persistent I/O errors, check logs
journalctl -u stunnel -n 50
dmesg | tail -20 | grep -i nfs
```

### Comprehensive Production Test Results

After implementing all the improvements (enhanced CARP control script, soft mounts, and automatic recovery), here's a complete test of the setup including reboots and failovers:

#### Test Scenario: Full System Reboot and Failover

```
1. Initial state: Rebooted all servers (f0, f1, f2)
   - Result: f1 became CARP MASTER after reboot (not always f0)
   - NFS accessible and writable from all clients
   
2. Created test file from laptop:
   paul@earth:~ % echo "Post-reboot test at $(date)" > /data/nfs/k3svolumes/reboot-test.txt
   
3. Verified 1-minute replication to f1:
   - File appeared on f1 within 70 seconds
   - Content identical on both servers
   
4. Performed failover from f0 to f1:
   paul@f0:~ % doas ifconfig re0 vhid 1 state backup
   - f1 immediately became MASTER
   - Clients experienced "Stale file handle" errors
   - With soft mounts: No hanging, immediate error response
   
5. Recovery time:
   - Manual recovery: Immediate with umount/mount
   - Automatic recovery: Within 60 seconds via cron job
   - No data loss during failover
   
6. Failback to f0:
   paul@f1:~ % doas ifconfig re0 vhid 1 state backup
   - f0 reclaimed MASTER status
   - Similar stale handle behavior
   - Recovery within 60 seconds
```

#### Key Findings

1. **CARP Master Selection**: After reboot, either f0 or f1 can become MASTER. This is normal CARP behavior and doesn't affect functionality.

2. **Stale File Handles**: Despite all optimizations, NFS clients still experience stale file handles during failover. This is inherent to NFS protocol design. However:
   - Soft mounts prevent hanging
   - Automatic recovery works reliably
   - No data loss occurs

3. **Replication Timing**: The 1-minute replication interval for NFS data ensures minimal data loss window during unplanned failovers. The Fedora VM replication runs every 10 minutes, which is sufficient for less critical VM data.

4. **Service Management**: The enhanced carpcontrol.sh script successfully stops services on BACKUP nodes, preventing split-brain scenarios.

### Cleanup After Testing

```sh
# Remove test files (run on any node)
rm -f /data/nfs/k3svolumes/test-*.txt
rm -f /data/nfs/k3svolumes/test-large-*.dat
rm -f /data/nfs/k3svolumes/failover-test-*.txt
rm -f /data/nfs/k3svolumes/test-perms-*.txt
rm -rf /data/nfs/k3svolumes/test-dir
```

This comprehensive testing ensures that:
- All nodes can mount the NFS share
- Files created on one node are visible on all others
- The encrypted stunnel connection is working
- Permissions and ownership are preserved
- The setup can handle concurrent access
- Failover works correctly (if tested)

Other *BSD-related posts:

<< template::inline::rindex bsd

E-Mail your comments to `paul@nospam.buetow.org`

=> ../ Back to the main site

https://forums.freebsd.org/threads/hast-and-zfs-with-carp-failover.29639/


E-Mail your comments to `paul@nospam.buetow.org`

=> ../ Back to the main site