diff options
Diffstat (limited to 'docs/clickhouse-streaming-plan.md')
| -rw-r--r-- | docs/clickhouse-streaming-plan.md | 1623 |
1 files changed, 1623 insertions, 0 deletions
diff --git a/docs/clickhouse-streaming-plan.md b/docs/clickhouse-streaming-plan.md new file mode 100644 index 0000000..785d8f4 --- /dev/null +++ b/docs/clickhouse-streaming-plan.md @@ -0,0 +1,1623 @@ +# ClickHouse Integration Plan for IOR + +This document outlines the implementation plan for integrating ClickHouse database output into IOR, supporting both real-time streaming and batch file export. + +## Overview + +| Mode | Format | Use Case | Expected Throughput | +|------|--------|----------|---------------------| +| **Streaming** | Native TCP protocol | Real-time ingestion to ClickHouse | 100K-1M events/sec | +| **File Dump** | Parquet | Batch export for later import | Same (offline processing) | + +## Architecture + +### Current Data Flow +``` +BPF → Ring Buffer → eventLoop → event.Pair → [flamegraph workers | console output] +``` + +### Proposed Data Flow +``` +BPF → Ring Buffer → eventLoop → event.Pair → [clickhouse-stream | parquet-writer | flamegraph | console] +``` + +--- + +## Part 1: ClickHouse Streaming (Native TCP Protocol) + +### 1.1 Dependencies + +Add to `go.mod`: +```go +require ( + github.com/ClickHouse/clickhouse-go/v2 v2.23.0 +) +``` + +### 1.2 New Package Structure + +``` +internal/ +├── clickhouse/ +│ ├── client.go # Connection management, connection pooling +│ ├── schema.go # Table schema definitions and DDL +│ ├── writer.go # Batch writer with buffering +│ ├── config.go # Configuration (host, port, database, table) +│ └── client_test.go # Unit tests +``` + +### 1.3 ClickHouse Table Schema + +```sql +CREATE TABLE ior_events ( + timestamp_ns UInt64, -- Event timestamp (nanoseconds) + pid UInt32, -- Process ID (high cardinality, no LowCardinality) + tid UInt32, -- Thread ID (high cardinality, no LowCardinality) + comm LowCardinality(String), + syscall_name LowCardinality(String), + trace_id UInt32, + event_type UInt8, -- ENTER_OPEN_EVENT, EXIT_OPEN_EVENT, etc. + + -- Result + ret_value Int64, -- Syscall return value + ret_type UInt32, -- Return type classification + + -- File information + fd Int32, + filename String, + pathname String, + oldname String, + newname String, + + -- Flags and metadata + flags Int32, + + -- Calculated fields (from event.Pair) + duration_ns UInt64, -- Duration of syscall + duration_to_prev_ns UInt64, -- Time since previous syscall + + -- Additional context + hostname LowCardinality(String), + collection_id UUID, -- Groups events from same collection run + ingested_at DateTime64(3) DEFAULT now64(3), + + -- Secondary indices for high-cardinality fields + INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4, + INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4 +) +ENGINE = MergeTree() +PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns)) +ORDER BY (timestamp_ns, pid, tid) +SETTINGS index_granularity = 8192; +``` + +**Cardinality Considerations:** +| Field | Cardinality | Encoding | Reason | +|-------|-------------|----------|--------| +| `pid` | High (thousands) | Plain UInt32 | PIDs can range into thousands per server; bloom filter index for point lookups | +| `tid` | Very High (tens of thousands) | Plain UInt32 | TIDs are numerous in threaded workloads; bloom filter index for point lookups | +| `comm` | Low (hundreds) | LowCardinality | Limited number of unique process names | +| `syscall_name` | Very Low (~100) | LowCardinality | Fixed set of syscalls | +| `hostname` | Very Low | LowCardinality | Usually single host per collection | +| `filename` | Medium-High | Plain String | Depends on workload; could use token bloom filter | + +**Optimization Notes:** +- **Bloom filter indices** on `pid` and `tid` enable efficient point lookups on these high-cardinality fields without bloating storage +- `LowCardinality` only for truly low-cardinality fields (`comm`, `syscall_name`, `hostname`) +- Partitioning by day for efficient time-based queries and TTL +- Ordering by `(timestamp_ns, pid, tid)` for time-range queries and per-process/thread analysis +- `collection_id` UUID to group events from the same tracing session + +### 1.4 Implementation Details + +#### 1.4.1 Configuration (`internal/clickhouse/config.go`) + +```go +package clickhouse + +type Config struct { + Host string // ClickHouse host (default: localhost) + Port int // ClickHouse port (default: 9000) + Database string // Database name (default: ior) + Table string // Table name (default: ior_events) + User string // Username + Password string // Password + BatchSize int // Events per batch (default: 10000) + FlushTimeout time.Duration // Max time before flush (default: 1s) + MaxOpenConns int // Connection pool size (default: 4) + Async bool // Enable async inserts (default: true for high throughput) +} + +func DefaultConfig() Config { + return Config{ + Host: "localhost", + Port: 9000, + Database: "ior", + Table: "ior_events", + BatchSize: 10000, + FlushTimeout: time.Second, + MaxOpenConns: 4, + Async: true, + } +} + +func ConfigFromFlags() Config { + // Read from command-line flags +} +``` + +#### 1.4.2 Client (`internal/clickhouse/client.go`) + +```go +package clickhouse + +import ( + "context" + "github.com/ClickHouse/clickhouse-go/v2" + "github.com/ClickHouse/clickhouse-go/v2/lib/driver" +) + +type Client struct { + conn driver.Conn + config Config +} + +func NewClient(ctx context.Context, config Config) (*Client, error) { + conn, err := clickhouse.Open(&clickhouse.Options{ + Addr: []string{fmt.Sprintf("%s:%d", config.Host, config.Port)}, + Auth: clickhouse.Auth{ + Database: config.Database, + Username: config.User, + Password: config.Password, + }, + MaxOpenConns: config.MaxOpenConns, + MaxIdleConns: config.MaxOpenConns, + ConnMaxLifetime: time.Hour, + DialTimeout: time.Second * 10, + Settings: clickhouse.Settings{ + "max_execution_time": 60, + }, + Compression: &clickhouse.Compression{ + Method: clickhouse.CompressionLZ4, + }, + BlockBufferSize: 10, + }) + if err != nil { + return nil, err + } + return &Client{conn: conn, config: config}, nil +} + +func (c *Client) Close() error { + return c.conn.Close() +} +``` + +#### 1.4.3 Batch Writer (`internal/clickhouse/writer.go`) + +Key design for 100K-1M events/sec: +- **Double buffering**: While one buffer is being sent, the other is being filled +- **Sync.Pool for event rows**: Reduce GC pressure +- **Async inserts**: ClickHouse async mode for lower latency +- **Connection pooling**: Multiple connections for parallel inserts + +```go +package clickhouse + +import ( + "context" + "sync" + "time" + + "ior/internal/event" +) + +type EventRow struct { + TimestampNs uint64 + Pid uint32 + Tid uint32 + Comm string + SyscallName string + TraceId uint32 + EventType uint8 + RetValue int64 + RetType uint32 + Fd int32 + Filename string + Pathname string + Oldname string + Newname string + Flags int32 + DurationNs uint64 + DurationToPrev uint64 + Hostname string + CollectionId string +} + +type Writer struct { + client *Client + config Config + + bufferMu sync.Mutex + buffer []*EventRow + bufferSize int + + flushTimer *time.Timer + flushCh chan struct{} + + ctx context.Context + cancel context.CancelFunc + wg sync.WaitGroup + + pool sync.Pool +} + +func NewWriter(ctx context.Context, client *Client, config Config) *Writer { + ctx, cancel := context.WithCancel(ctx) + w := &Writer{ + client: client, + config: config, + buffer: make([]*EventRow, 0, config.BatchSize), + flushCh: make(chan struct{}, 1), + ctx: ctx, + cancel: cancel, + pool: sync.Pool{ + New: func() interface{} { return &EventRow{} }, + }, + } + w.flushTimer = time.AfterFunc(config.FlushTimeout, w.triggerFlush) + go w.flushLoop() + return w +} + +func (w *Writer) Write(ep *event.Pair) error { + row := w.pool.Get().(*EventRow) + w.populateRow(row, ep) + + w.bufferMu.Lock() + w.buffer = append(w.buffer, row) + shouldFlush := len(w.buffer) >= w.config.BatchSize + w.bufferMu.Unlock() + + if shouldFlush { + w.triggerFlush() + } + return nil +} + +func (w *Writer) triggerFlush() { + select { + case w.flushCh <- struct{}{}: + default: + } +} + +func (w *Writer) flushLoop() { + for { + select { + case <-w.flushCh: + w.flush() + case <-w.ctx.Done(): + w.flush() + return + } + } +} + +func (w *Writer) flush() { + w.bufferMu.Lock() + if len(w.buffer) == 0 { + w.bufferMu.Unlock() + return + } + buffer := w.buffer + w.buffer = make([]*EventRow, 0, w.config.BatchSize) + w.bufferMu.Unlock() + + ctx := clickhouse.Context(w.ctx, clickhouse.WithSettings(clickhouse.Settings{ + "async_insert": 1, + "wait_for_async_insert": 0, + })) + + batch, err := w.client.conn.PrepareBatch(ctx, + "INSERT INTO ior_events (timestamp_ns, pid, tid, comm, syscall_name, ...)") + if err != nil { + return + } + + for _, row := range buffer { + batch.Append( + row.TimestampNs, row.Pid, row.Tid, row.Comm, + row.SyscallName, row.TraceId, row.EventType, + row.RetValue, row.RetType, row.Fd, row.Filename, + row.Pathname, row.Oldname, row.Newname, row.Flags, + row.DurationNs, row.DurationToPrev, row.Hostname, + row.CollectionId, + ) + w.pool.Put(row) + } + + batch.Send() +} + +func (w *Writer) Close() { + w.cancel() + w.wg.Wait() + w.flushTimer.Stop() +} +``` + +#### 1.4.4 Schema Management (`internal/clickhouse/schema.go`) + +```go +package clickhouse + +func (c *Client) CreateTableIfNotExists(ctx context.Context) error { + ddl := `CREATE TABLE IF NOT EXISTS ior_events (...)` + return c.conn.Exec(ctx, ddl) +} +``` + +### 1.5 Integration with Event Loop + +Modify `internal/eventloop.go`: + +```go +func (e *eventLoop) run(ctx context.Context, rawCh <-chan []byte) { + var chWriter *clickhouse.Writer + if flags.Get().ClickHouseEnable { + chClient, err := clickhouse.NewClient(ctx, clickhouse.ConfigFromFlags()) + if err != nil { + panic(err) + } + defer chClient.Close() + chWriter = clickhouse.NewWriter(ctx, chClient, clickhouse.ConfigFromFlags()) + defer chWriter.Close() + } + + for ep := range e.events(ctx, rawCh) { + switch { + case flags.Get().ClickHouseEnable: + chWriter.Write(ep) + ep.Recycle() + case flags.Get().FlamegraphEnable: + e.flamegraph.Ch <- ep + // ... rest + } + } +} +``` + +### 1.6 New Command-Line Flags + +Add to `internal/flags/flags.go`: + +```go +type Flags struct { + // ... existing fields ... + + // ClickHouse streaming + ClickHouseEnable bool + ClickHouseHost string + ClickHousePort int + ClickHouseDatabase string + ClickHouseTable string + ClickHouseUser string + ClickHouseBatchSize int + ClickHouseAsync bool +} + +func parse() { + // ... existing flags ... + + flag.BoolVar(&singleton.ClickHouseEnable, "clickhouse", false, + "Enable ClickHouse streaming output") + flag.StringVar(&singleton.ClickHouseHost, "ch-host", "localhost", + "ClickHouse host") + flag.IntVar(&singleton.ClickHousePort, "ch-port", 9000, + "ClickHouse native port") + flag.StringVar(&singleton.ClickHouseDatabase, "ch-db", "ior", + "ClickHouse database") + flag.StringVar(&singleton.ClickHouseTable, "ch-table", "ior_events", + "ClickHouse table") + flag.StringVar(&singleton.ClickHouseUser, "ch-user", "", + "ClickHouse user") + flag.IntVar(&singleton.ClickHouseBatchSize, "ch-batch", 10000, + "ClickHouse batch size") +} +``` + +--- + +## Part 2: Parquet File Export + +### 2.1 Dependencies + +Add to `go.mod`: +```go +require ( + github.com/parquet-go/parquet-go v0.23.0 +) +``` + +### 2.2 New Package Structure + +``` +internal/ +├── parquet/ +│ ├── writer.go # Parquet file writer +│ ├── schema.go # Parquet schema definition +│ └── writer_test.go # Unit tests +``` + +### 2.3 Parquet Schema + +```go +package parquet + +import "github.com/parquet-go/parquet-go" + +type EventRow struct { + TimestampNs uint64 `parquet:"timestamp_ns"` + Pid uint32 `parquet:"pid"` + Tid uint32 `parquet:"tid"` + Comm string `parquet:"comm,dict"` + SyscallName string `parquet:"syscall_name,dict"` + TraceId uint32 `parquet:"trace_id"` + EventType uint8 `parquet:"event_type"` + RetValue int64 `parquet:"ret_value"` + RetType uint32 `parquet:"ret_type"` + Fd int32 `parquet:"fd"` + Filename string `parquet:"filename"` + Pathname string `parquet:"pathname"` + Oldname string `parquet:"oldname"` + Newname string `parquet:"newname"` + Flags int32 `parquet:"flags"` + DurationNs uint64 `parquet:"duration_ns"` + DurationToPrev uint64 `parquet:"duration_to_prev_ns"` + Hostname string `parquet:"hostname,dict"` + CollectionId string `parquet:"collection_id"` +} +``` + +**Parquet Optimizations:** +- `dict` encoding for low-cardinality strings (`comm`, `syscall_name`, `hostname`) +- Zstd compression (same as current GOB files) +- Row groups of ~128MB for efficient querying +- Column pruning - ClickHouse only reads needed columns + +### 2.4 Implementation + +#### 2.4.1 Writer (`internal/parquet/writer.go`) + +```go +package parquet + +import ( + "context" + "fmt" + "os" + "sync" + "time" + + "github.com/parquet-go/parquet-go" + "github.com/parquet-go/parquet-go/compress/zstd" + "ior/internal/event" +) + +type Writer struct { + file *os.File + writer *parquet.GenericWriter[EventRow] + + bufferMu sync.Mutex + buffer []EventRow + bufferSize int + + rowGroupSize int + + ctx context.Context + cancel context.CancelFunc +} + +type WriterConfig struct { + Filename string + BatchSize int + RowGroupSize int + Compression parquet.Compression +} + +func DefaultWriterConfig() WriterConfig { + hostname, _ := os.Hostname() + return WriterConfig{ + Filename: fmt.Sprintf("%s-%s.parquet", hostname, time.Now().Format("2006-01-02_15:04:05")), + BatchSize: 10000, + RowGroupSize: 100000, + Compression: parquet.Zstd, + } +} + +func NewWriter(ctx context.Context, config WriterConfig) (*Writer, error) { + file, err := os.Create(config.Filename) + if err != nil { + return nil, err + } + + writer := parquet.NewGenericWriter[EventRow](file, + parquet.Compression(&zstd.Codec{Level: zstd.DefaultLevel}), + parquet.RowGroupSize(config.RowGroupSize), + ) + + ctx, cancel := context.WithCancel(ctx) + return &Writer{ + file: file, + writer: writer, + buffer: make([]EventRow, 0, config.BatchSize), + bufferSize: config.BatchSize, + rowGroupSize: config.RowGroupSize, + ctx: ctx, + cancel: cancel, + }, nil +} + +func (w *Writer) Write(ep *event.Pair) error { + row := EventRow{ + TimestampNs: ep.EnterEv.GetTime(), + Pid: ep.EnterEv.GetPid(), + Tid: ep.EnterEv.GetTid(), + Comm: ep.Comm, + SyscallName: ep.EnterEv.GetTraceId().Name(), + DurationNs: ep.Duration, + DurationToPrev: ep.DurationToPrev, + } + + w.bufferMu.Lock() + w.buffer = append(w.buffer, row) + shouldFlush := len(w.buffer) >= w.bufferSize + w.bufferMu.Unlock() + + if shouldFlush { + return w.flush() + } + return nil +} + +func (w *Writer) flush() error { + w.bufferMu.Lock() + defer w.bufferMu.Unlock() + + if len(w.buffer) == 0 { + return nil + } + + _, err := w.writer.Write(w.buffer) + w.buffer = w.buffer[:0] + return err +} + +func (w *Writer) Close() error { + w.cancel() + if err := w.flush(); err != nil { + return err + } + if err := w.writer.Close(); err != nil { + return err + } + return w.file.Close() +} +``` + +### 2.5 ClickHouse Import Command + +After generating a Parquet file: + +```bash +# Local file import +clickhouse-client --query " + INSERT INTO ior.ior_events + FROM INFILE 'hostname-2024-01-15_10:30:00.parquet' + FORMAT Parquet" + +# Or via HTTP (remote server) +curl -X POST 'http://clickhouse:8123/?query=INSERT+INTO+ior.ior_events+FORMAT+Parquet' \ + --data-binary @hostname-2024-01-15_10:30:00.parquet +``` + +### 2.6 New Command-Line Flags + +Add to `internal/flags/flags.go`: + +```go +// Parquet file output +ParquetEnable bool +ParquetFilename string +ParquetBatchSize int +ParquetRowGroupSize int +``` + +--- + +## Part 3: Shared Components + +### 3.1 Event-to-Row Converter + +Both ClickHouse and Parquet need to convert `event.Pair` to a row format: + +```go +// internal/export/convert.go + +package export + +import ( + "ior/internal/event" + "ior/internal/types" +) + +type EventRow struct { + TimestampNs uint64 + Pid uint32 + Tid uint32 + Comm string + SyscallName string + TraceId uint32 + EventType uint8 + RetValue int64 + RetType uint32 + Fd int32 + Filename string + Pathname string + Oldname string + Newname string + Flags int32 + DurationNs uint64 + DurationToPrev uint64 + Hostname string + CollectionId string +} + +func PairToRow(ep *event.Pair, hostname, collectionId string) EventRow { + row := EventRow{ + TimestampNs: ep.EnterEv.GetTime(), + Pid: ep.EnterEv.GetPid(), + Tid: ep.EnterEv.GetTid(), + Comm: ep.Comm, + SyscallName: ep.EnterEv.GetTraceId().Name(), + TraceId: uint32(ep.EnterEv.GetTraceId()), + DurationNs: ep.Duration, + DurationToPrev: ep.DurationToPrev, + Hostname: hostname, + CollectionId: collectionId, + } + + switch enter := ep.EnterEv.(type) { + case *types.OpenEvent: + row.EventType = types.ENTER_OPEN_EVENT + row.Filename = types.StringValue(enter.Filename[:]) + row.Flags = enter.Flags + case *types.FdEvent: + row.EventType = types.ENTER_FD_EVENT + row.Fd = enter.Fd + } + + if ret, ok := ep.ExitEv.(*types.RetEvent); ok { + row.RetValue = ret.Ret + row.RetType = ret.RetType + } + + return row +} +``` + +### 3.2 Output Mode Selector + +```go +// internal/output/output.go + +package output + +type Output interface { + Write(ep *event.Pair) error + Close() error +} + +func NewOutput(ctx context.Context, flags flags.Flags) (Output, error) { + switch { + case flags.ClickHouseEnable: + return clickhouse.NewWriter(ctx, ...) + case flags.ParquetEnable: + return parquet.NewWriter(ctx, ...) + case flags.FlamegraphEnable: + return flamegraph.NewCollector(ctx, ...) + default: + return &consoleOutput{}, nil + } +} +``` + +--- + +## Part 4: Performance Considerations + +### 4.1 Throughput Targets: 100K-1M events/sec + +| Component | Strategy | +|-----------|----------| +| **Memory** | `sync.Pool` for EventRow reuse, avoid allocations in hot path | +| **Buffering** | Double buffering: fill one buffer while sending another | +| **Batching** | Batch inserts: 10K-100K rows per batch | +| **Compression** | LZ4 for streaming (fast), Zstd for files (compact) | +| **Concurrency** | Multiple writer goroutines with separate connections | +| **Backpressure** | Drop events if buffer full (configurable), report stats | + +### 4.2 Memory Budget + +For 1M events/sec with 10KB per event row: +- Raw: 10GB/sec (too high) +- With batching and buffering: ~100MB buffer is sufficient + +Buffer sizing: +- 100K events × ~200 bytes per row = ~20MB per buffer +- Double buffering = ~40MB total + +### 4.3 ClickHouse Server-Side Settings + +```sql +SET max_insert_block_size = 1048576; +SET max_block_size = 65536; +SET async_insert_max_data_size = 10000000; +SET async_insert_busy_timeout_ms = 1000; +``` + +--- + +## Part 5: Testing Strategy + +### 5.1 Unit Tests + +``` +internal/clickhouse/ +├── client_test.go # Mock server tests +├── writer_test.go # Buffer management, batch logic +└── schema_test.go # DDL generation + +internal/parquet/ +├── writer_test.go # File writing, schema validation +└── convert_test.go # Event to row conversion +``` + +### 5.2 Integration Tests + +```bash +docker run -d --name clickhouse -p 9000:9000 clickhouse/clickhouse-server +make test-integration-clickhouse +make test-integration-parquet +``` + +### 5.3 Benchmark Tests + +```go +// internal/bench_test.go +func BenchmarkClickHouseWriter(b *testing.B) {} + +func BenchmarkParquetWriter(b *testing.B) {} +``` + +--- + +## Part 6: Implementation Order + +### Phase 1: Foundation +1. Add dependencies to `go.mod` +2. Create `internal/export/convert.go` - shared row conversion +3. Create `internal/output/output.go` - output interface + +### Phase 2: Parquet Export +1. Create `internal/parquet/` package +2. Add parquet flags to `internal/flags/` +3. Integrate with `internal/eventloop.go` +4. Add unit tests +5. Test ClickHouse import + +### Phase 3: ClickHouse Streaming +1. Create `internal/clickhouse/` package +2. Add ClickHouse flags to `internal/flags/` +3. Implement double-buffering writer +4. Integrate with `internal/eventloop.go` +5. Add unit tests and integration tests + +### Phase 4: Polish +1. Performance benchmarks and optimization +2. Documentation +3. Error handling and recovery +4. Metrics/monitoring integration + +--- + +## Part 7: Usage Examples + +### Parquet Export + +```bash +# Capture to Parquet file +ior -parquet -name my_trace -duration 60 + +# Import to ClickHouse +clickhouse-client --query " + INSERT INTO ior.ior_events + FROM INFILE 'myhost-2024-01-15_10:30:00.parquet' + FORMAT Parquet" +``` + +### Real-time Streaming + +```bash +# Stream to ClickHouse +ior -clickhouse -ch-host ch-server.example.com -ch-db ior -duration 300 + +# With authentication +ior -clickhouse -ch-host ch.example.com -ch-user app -ch-password secret +``` + +### Query Examples + +```sql +-- Top 10 slowest syscalls by average duration +SELECT + syscall_name, + count() as total, + avg(duration_ns) as avg_duration, + quantile(0.99)(duration_ns) as p99_duration +FROM ior_events +WHERE timestamp_ns > now() - INTERVAL 1 HOUR +GROUP BY syscall_name +ORDER BY avg_duration DESC +LIMIT 10; + +-- Events per process +SELECT + pid, + comm, + count() as total_events, + sum(duration_ns) / 1e9 as total_duration_sec +FROM ior_events +WHERE collection_id = 'uuid-here' +GROUP BY pid, comm +ORDER BY total_events DESC; + +-- Per-thread I/O activity (leverages bloom filter on tid) +SELECT + pid, + tid, + comm, + count() as total_events, + sum(duration_ns) / 1e9 as total_duration_sec, + uniqExact(syscall_name) as unique_syscalls +FROM ior_events +WHERE collection_id = 'uuid-here' +GROUP BY pid, tid, comm +ORDER BY total_events DESC +LIMIT 50; + +-- Thread with most I/O latency (bloom filter helps for specific tid lookup) +SELECT + pid, tid, comm, + sum(duration_ns) / 1e6 as total_latency_ms, + avg(duration_ns) as avg_latency_ns, + count() as event_count +FROM ior_events +WHERE tid = 12345 -- bloom filter index used here + AND collection_id = 'uuid-here' +GROUP BY pid, tid, comm; + +-- Most accessed files +SELECT + filename, + count() as access_count, + sum(duration_ns) / 1e9 as total_duration_sec +FROM ior_events +WHERE filename != '' +GROUP BY filename +ORDER BY access_count DESC +LIMIT 20; +``` + +--- + +## Appendix A: File Sizes Estimation + +For 1M events: +| Format | Size (estimated) | +|--------|------------------| +| GOB + zstd (current) | ~50-80 MB | +| Parquet + zstd | ~40-60 MB | +| RowBinary | ~60-80 MB | +| JSON (not recommended) | ~200-300 MB | + +--- + +## Appendix C: ClickHouse Optimization Strategies + +### C.1 High Cardinality Optimization + +**Problem**: Fields like `tid` (thread ID) can have tens of thousands of unique values, making standard indexing inefficient. + +| Technique | Description | Best For | +|-----------|-------------|----------| +| **Bloom Filter Index** | Probabilistic index for point lookups (~1% storage overhead) | `WHERE tid = 12345` | +| **Minmax Index** | Stores min/max values per granule | Range queries on numeric fields | +| **Set Index** | Stores unique values per granule (limited to ~N values) | Low-medium cardinality | +| **Token Bloom Filter** | Bloom filter on tokens in string | `WHERE hasToken(filename, 'log')` | + +```sql +-- Bloom filter for point lookups on high-cardinality fields +INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4, +INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4, + +-- Minmax for range queries on timestamps (already default, but explicit) +INDEX idx_duration duration_ns TYPE minmax GRANULARITY 1, + +-- Token bloom filter for filename substring searches +INDEX idx_filename_tokens filename TYPE tokenbf_v1(512, 3, 0) GRANULARITY 4 +``` + +**Bloom Filter Parameters**: +- `0.01` = 1% false positive rate (lower = more accurate, larger index) +- `GRANULARITY 4` = index covers 4 granules (8192 × 4 = 32768 rows) + +#### C.1.1 Are Bloom Filter Results Inexact? + +**No - query results are ALWAYS exact.** Bloom filters only affect performance, not correctness. + +``` +How Bloom Filter Indices Work: ++---------------------------------------------------------------+ +| QUERY: SELECT * FROM ior_events WHERE tid = 12345 | ++---------------------------------------------------------------+ + | + v ++---------------------------------------------------------------+ +| STEP 1: Check bloom filter index for each granule | +| | +| Granule 1 (rows 1-8192): Bloom says "MAYBE" -> READ IT | +| Granule 2 (rows 8193-16384): Bloom says "DEFINITELY NOT" | +| -> SKIP | +| Granule 3 (rows 16385-24576): Bloom says "MAYBE" -> READ IT | +| ... | ++---------------------------------------------------------------+ + | + v ++---------------------------------------------------------------+ +| STEP 2: Read the "MAYBE" granules and filter EXACTLY | +| | +| Result: ALL rows where tid = 12345 (no false positives) | ++---------------------------------------------------------------+ +``` + +**False Positive Impact**: The bloom filter might say "MAYBE contains tid=12345" for a granule that doesn't actually have it. This causes ClickHouse to read that granule unnecessarily - wasting I/O but NOT affecting result correctness. + +**False Negatives**: Impossible. Bloom filters never produce false negatives. If the value exists, the bloom filter will always say "MAYBE". + +**Trade-off**: +| Bloom Filter Size | False Positive Rate | Storage Overhead | Unnecessary Reads | +|-------------------|---------------------|------------------|-------------------| +| Smaller | Higher (~5%) | ~0.5% | More wasted I/O | +| Larger | Lower (~0.01%) | ~2% | Less wasted I/O | + +**Recommendation**: Use `bloom_filter(0.01)` for 1% false positive rate - good balance between index size and read efficiency. + +#### C.1.2 Projections vs Materialized Views + +Both optimize query performance, but work differently: + +| Aspect | Projection | Materialized View | +|--------|------------|-------------------| +| **Data storage** | Same table, different physical order | Separate table | +| **Automatic sync** | Yes - part of the same table | Yes - triggered on INSERT | +| **Query rewrite** | Automatic - ClickHouse picks best projection | Must query MV explicitly | +| **Aggregation** | Can pre-aggregate | Often used for aggregation | +| **Space overhead** | Copies ALL columns (unless aggregated) | Only stores aggregated result | +| **Best for** | Different sort orders, same columns | Pre-computed aggregations | + +--- + +### C.1.3 Projections Explained + +**What is a Projection?** + +A projection is an alternative physical representation of the SAME data within the SAME table. Think of it as "same data, sorted differently for different queries." + +**Example Problem**: Your main table is ordered by `(timestamp_ns, pid, tid)` for time-range queries. But you also need fast queries by thread: + +```sql +-- This query is SLOW because tid is last in ORDER BY +SELECT * FROM ior_events WHERE tid = 12345; +-- Must scan almost all data! +``` + +**Solution - Add a Projection**: + +```sql +CREATE TABLE ior_events ( + timestamp_ns UInt64, + pid UInt32, + tid UInt32, + comm String, + syscall_name String, + duration_ns UInt64, + -- ... other columns ... +) +ENGINE = MergeTree() +ORDER BY (timestamp_ns, pid, tid) -- Primary order: time-first +PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns)); + +-- Add projection for thread-centric queries +ALTER TABLE ior_events ADD PROJECTION proj_by_thread ( + SELECT * ORDER BY (pid, tid, timestamp_ns) +); + +-- Materialize the projection (build it for existing data) +ALTER TABLE ior_events MATERIALIZE PROJECTION proj_by_thread; +``` + +**How it works**: + +``` +INSERT INTO ior_events VALUES (1000, 1, 100, 'app', 'read', 50); +INSERT INTO ior_events VALUES (2000, 1, 100, 'app', 'write', 30); +INSERT INTO ior_events VALUES (1500, 2, 200, 'db', 'read', 40); + +Main storage (ORDER BY timestamp_ns, pid, tid): ++--------------+-----+-----+------+-------------+-------------+ +| timestamp_ns | pid | tid | comm | syscall_name| duration_ns | ++--------------+-----+-----+------+-------------+-------------+ +| 1000 | 1 | 100 | app | read | 50 | +| 1500 | 2 | 200 | db | read | 40 | +| 2000 | 1 | 100 | app | write | 30 | ++--------------+-----+-----+------+-------------+-------------+ + +Projection proj_by_thread (ORDER BY pid, tid, timestamp_ns): ++--------------+-----+-----+------+-------------+-------------+ +| timestamp_ns | pid | tid | comm | syscall_name| duration_ns | ++--------------+-----+-----+------+-------------+-------------+ +| 1000 | 1 | 100 | app | read | 50 | +| 2000 | 1 | 100 | app | write | 30 | +| 1500 | 2 | 200 | db | read | 40 | ++--------------+-----+-----+------+-------------+-------------+ +``` + +**Query optimization**: + +```sql +-- Query 1: Time range - uses main storage +SELECT * FROM ior_events +WHERE timestamp_ns BETWEEN 1000 AND 1800; +-- Reads rows in order: (1000,1,100), (1500,2,200) - efficient! + +-- Query 2: Thread lookup - uses projection AUTOMATICALLY +SELECT * FROM ior_events WHERE tid = 100; +-- ClickHouse sees projection has better ORDER BY for this query +-- Uses proj_by_thread: rows (1,100,1000), (1,100,2000) are adjacent! +``` + +**Aggregating Projection** (smaller storage): + +```sql +-- Pre-aggregated projection - stores only aggregated data +ALTER TABLE ior_events ADD PROJECTION proj_hourly_stats ( + SELECT + toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour, + syscall_name, + count() as event_count, + sum(duration_ns) as total_duration, + avg(duration_ns) as avg_duration + GROUP BY hour, syscall_name +); + +ALTER TABLE ior_events MATERIALIZE PROJECTION proj_hourly_stats; +``` + +**Storage comparison**: +``` +Main table: 1,000,000,000 rows × ~200 bytes = ~200 GB +proj_by_thread: 1,000,000,000 rows × ~200 bytes = ~200 GB (full copy) +proj_hourly_stats: ~100,000 rows × ~50 bytes = ~5 MB (aggregated!) +``` + +**When to use projections**: +- Different access patterns on the same raw data +- Query patterns known upfront +- Can afford storage overhead (projections copy data) + +--- + +### C.1.4 Materialized Views Explained + +**What is a Materialized View?** + +A materialized view is a SEPARATE table that is automatically populated and maintained when data is inserted into the source table. + +**Example Problem**: You frequently query per-thread statistics: + +```sql +-- This is slow - scans billions of rows every time +SELECT pid, tid, comm, count(), sum(duration_ns) +FROM ior_events +WHERE timestamp_ns > now() - INTERVAL 1 HOUR +GROUP BY pid, tid, comm; +``` + +**Solution - Create a Materialized View**: + +```sql +-- Step 1: Create the target table (stores the aggregated data) +CREATE TABLE ior_thread_stats ( + day Date, + hour DateTime, + pid UInt32, + tid UInt32, + comm LowCardinality(String), + syscall_name LowCardinality(String), + event_count UInt64, + total_duration_ns UInt64 +) +ENGINE = SummingMergeTree() -- Automatically sums duplicates +PARTITION BY day +ORDER BY (day, hour, pid, tid, syscall_name); + +-- Step 2: Create the materialized view (transforms INSERTs) +CREATE MATERIALIZED VIEW ior_thread_stats_mv TO ior_thread_stats AS +SELECT + toDate(fromUnixTimestamp64Nano(timestamp_ns)) as day, + toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour, + pid, + tid, + comm, + syscall_name, + count() as event_count, + sum(duration_ns) as total_duration_ns +FROM ior_events +GROUP BY day, hour, pid, tid, comm, syscall_name; +``` + +**How it works**: + +``` +INSERT INTO ior_events (timestamp_ns, pid, tid, comm, syscall_name, duration_ns, ...) +VALUES (1704067200000000000, 1, 100, 'app', 'read', 50, ...); + + | + v ++---------------------------------------------------------------+ +| ior_events (main table) | +| Receives the full row | ++---------------------------------------------------------------+ + | + | TRIGGER: materialized view + v ++---------------------------------------------------------------+ +| ior_thread_stats_mv transformation | +| Groups and aggregates: | +| day=2024-01-01, hour=2024-01-01 00:00:00 | +| pid=1, tid=100, comm='app', syscall_name='read' | +| event_count=1, total_duration_ns=50 | ++---------------------------------------------------------------+ + | + v ++---------------------------------------------------------------+ +| ior_thread_stats (target table) | +| Stores only the aggregated row | ++---------------------------------------------------------------+ +``` + +**Query the materialized view**: + +```sql +-- Query the aggregated table (MUCH faster!) +SELECT + pid, tid, comm, + sum(event_count) as total_events, + sum(total_duration_ns) / 1e9 as total_seconds +FROM ior_thread_stats +WHERE day = today() AND hour > now() - INTERVAL 6 HOUR +GROUP BY pid, tid, comm +ORDER BY total_events DESC +LIMIT 10; +``` + +**Real-time aggregation example**: + +```sql +-- Per-minute syscall latency tracking +CREATE TABLE ior_latency_minute ( + minute DateTime, + syscall_name LowCardinality(String), + p50_ns UInt64, + p90_ns UInt64, + p99_ns UInt64, + count UInt64 +) +ENGINE = SummingMergeTree() +ORDER BY (minute, syscall_name); + +-- Materialized view with quantile aggregation +CREATE MATERIALIZED VIEW ior_latency_minute_mv TO ior_latency_minute AS +SELECT + toStartOfMinute(fromUnixTimestamp64Nano(timestamp_ns)) as minute, + syscall_name, + quantile(0.50)(duration_ns) as p50_ns, + quantile(0.90)(duration_ns) as p90_ns, + quantile(0.99)(duration_ns) as p99_ns, + count() as count +FROM ior_events +GROUP BY minute, syscall_name; + +-- Query for real-time monitoring +SELECT minute, syscall_name, p99_ns / 1e6 as p99_ms, count +FROM ior_latency_minute +WHERE minute > now() - INTERVAL 1 HOUR +ORDER BY minute, p99_ns DESC; +``` + +**When to use materialized views**: +- Pre-computed aggregations (counts, sums, quantiles) +- Different granularity (hourly, daily summaries) +- Dashboards and real-time monitoring +- When you can afford slight delay (async update) + +**Important considerations**: +```sql +-- MVs are populated only for NEW data, not historical +-- To backfill existing data: +INSERT INTO ior_thread_stats +SELECT ... FROM ior_events WHERE ... GROUP BY ...; + +-- MVs can be chained (MV feeding another MV) +-- MVs can be suspended/resumed: +SYSTEM STOP MERGES ior_thread_stats; +SYSTEM START MERGES ior_thread_stats; +``` + +--- + +### C.1.5 Comparison Summary + +``` +PROJECTION: Same table, different order ++--------------------------------------------------------------+ +| ior_events (main table) | +| ORDER BY (timestamp_ns, pid, tid) | +| +-- Part 1: [rows sorted by time...] | +| +-- Part 2: [rows sorted by time...] | ++--------------------------------------------------------------+ + | + +-- PROJECTION proj_by_thread + +--------------------------------------------------+ + | Same data, ORDER BY (pid, tid, timestamp_ns) | + | +-- Part 1: [rows sorted by thread...] | + | +-- Part 2: [rows sorted by thread...] | + +--------------------------------------------------+ + +MATERIALIZED VIEW: Separate table, transformed data ++--------------------------------------------------------------+ +| ior_events (source table) | +| 1,000,000,000 rows x 200 bytes = 200 GB | ++--------------------------------------------------------------+ + | + | INSERT triggers transformation + v ++--------------------------------------------------------------+ +| ior_thread_stats (target table) | +| 100,000 rows x 50 bytes = 5 MB | +| Aggregated: count, sum, avg per thread per hour | ++--------------------------------------------------------------+ +``` + +**Choose Projection when**: +- Need different sort orders for same raw data +- Query patterns vary (time-range vs thread-lookup) +- Storage overhead is acceptable + +**Choose Materialized View when**: +- Need pre-aggregated results +- Query same aggregations repeatedly +- Want to reduce data volume significantly +- Building dashboards/monitoring + +**Avoid for High Cardinality**: +- `LowCardinality()` - only for fields with <10k unique values +- First position in `ORDER BY` - kills compression + +### C.2 Large Data Volume Optimization + +**Problem**: Billions of rows, terabytes of data need efficient storage and query. + +| Technique | Description | Impact | +|-----------|-------------|--------| +| **Partitioning** | Split data by time (day/month) | Faster deletes, pruning, TTL | +| **Compression Codecs** | ZSTD, LZ4, Delta, Gorilla | 5-10x storage reduction | +| **TTL** | Automatic data expiration | Storage management | +| **Projections** | Pre-computed alternative ORDER BY | Multiple query patterns | +| **Materialized Views** | Pre-aggregations | Faster analytics | + +```sql +CREATE TABLE ior_events ( + -- ... columns ... + + -- Compression codecs per column type + timestamp_ns UInt64 CODEC(Delta, ZSTD(3)), + pid UInt32 CODEC(ZSTD(3)), + tid UInt32 CODEC(ZSTD(3)), + comm LowCardinality(String) CODEC(ZSTD(3)), + syscall_name LowCardinality(String) CODEC(ZSTD(3)), + duration_ns UInt64 CODEC(Delta, ZSTD(3)), + filename String CODEC(ZSTD(3)), + + -- Indices + INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4, + INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4 +) +ENGINE = MergeTree() +PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns)) +ORDER BY (timestamp_ns, pid, tid) +TTL timestamp_ns + INTERVAL 30 DAY DELETE -- Auto-delete after 30 days +SETTINGS + index_granularity = 8192, + min_bytes_for_wide_part = '10M'; -- Compact parts until 10MB +``` + +**Compression Codec Selection**: + +| Column Type | Recommended Codec | Reason | +|-------------|-------------------|--------| +| Timestamps | `Delta(8), ZSTD(3)` | Delta encoding + compression | +| IDs (pid, tid) | `ZSTD(3)` | General compression | +| Low-cardinality strings | `LowCardinality, ZSTD(3)` | Dictionary + compression | +| Durations | `Delta, ZSTD(3)` | Sequential values compress well | +| Filenames | `ZSTD(3)` or `ZSTD(1)` | High compression if CPU-bound | +| Flags/enums | `ZSTD(3)` | Small value range | + +### C.3 Combined: High Cardinality + Large Data Volume + +**The Challenge**: Billions of rows with high-cardinality fields (tid, filename) require both efficient storage AND fast point lookups. + +**Strategy 1: Optimal ORDER BY** + +```sql +-- Rule: Low cardinality first, high cardinality last +-- BAD: ORDER BY (tid, timestamp_ns) -- Kills compression +-- GOOD: ORDER BY (timestamp_ns, pid, tid) -- Time-based locality + +ORDER BY (timestamp_ns, pid, tid) +``` + +This enables: +- Time-range queries: `WHERE timestamp_ns BETWEEN x AND y` (primary key efficiency) +- Per-process queries: `WHERE timestamp_ns BETWEEN x AND y AND pid = 123` +- Per-thread queries: `WHERE timestamp_ns BETWEEN x AND y AND pid = 123 AND tid = 456` + +**Strategy 2: Projections for Different Access Patterns** + +```sql +CREATE TABLE ior_events ( + -- ... columns ... +) +ENGINE = MergeTree() +PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns)) +ORDER BY (timestamp_ns, pid, tid) + +-- Projection 1: Thread-centric queries (fast tid lookup) +PROJECTION proj_by_thread +( + SELECT * + ORDER BY (pid, tid, timestamp_ns) +) + +-- Projection 2: Syscall analytics (aggregation-heavy) +PROJECTION proj_by_syscall +( + SELECT + syscall_name, + toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour, + pid, + count() as event_count, + sum(duration_ns) as total_duration, + avg(duration_ns) as avg_duration, + quantile(0.99)(duration_ns) as p99_duration + GROUP BY syscall_name, hour, pid +) +; +``` + +**Strategy 3: Materialized Views for Pre-Aggregation** + +```sql +-- Real-time per-thread stats (much smaller table) +CREATE MATERIALIZED VIEW ior_thread_stats_mv +ENGINE = SummingMergeTree() +PARTITION BY toYYYYMM(day) +ORDER BY (day, pid, tid, syscall_name) +AS SELECT + toDate(fromUnixTimestamp64Nano(timestamp_ns)) as day, + pid, + tid, + comm, + syscall_name, + count() as event_count, + sum(duration_ns) as total_duration_ns +FROM ior_events +GROUP BY day, pid, tid, comm, syscall_name; + +-- Query the materialized view instead (100-1000x faster) +SELECT pid, tid, sum(event_count), sum(total_duration_ns)/1e9 as sec +FROM ior_thread_stats_mv +WHERE day = today() AND tid = 12345 +GROUP BY pid, tid; +``` + +**Strategy 4: Sampling for Exploratory Queries** + +```sql +-- Sample 1% of data for quick exploration +SELECT syscall_name, count() as cnt, avg(duration_ns) as avg_dur +FROM ior_events +SAMPLE 0.01 -- Only scan 1% of rows +WHERE timestamp_ns > now() - INTERVAL 1 DAY +GROUP BY syscall_name +ORDER BY cnt DESC; +``` + +### C.4 Complete Optimized Schema + +```sql +CREATE TABLE ior_events ( + -- Core fields + timestamp_ns UInt64 CODEC(Delta(8), ZSTD(3)), + pid UInt32 CODEC(ZSTD(3)), + tid UInt32 CODEC(ZSTD(3)), + comm LowCardinality(String) CODEC(ZSTD(3)), + syscall_name LowCardinality(String) CODEC(ZSTD(3)), + trace_id UInt16 CODEC(ZSTD(3)), + event_type UInt8 CODEC(T64, ZSTD(3)), + + -- Result + ret_value Int64 CODEC(ZSTD(3)), + ret_type UInt8 CODEC(T64, ZSTD(3)), + + -- File information + fd Int32 CODEC(ZSTD(3)), + filename String CODEC(ZSTD(3)), + pathname String CODEC(ZSTD(3)), + oldname String CODEC(ZSTD(3)), + newname String CODEC(ZSTD(3)), + + -- Flags + flags Int32 CODEC(ZSTD(3)), + + -- Calculated + duration_ns UInt64 CODEC(Delta, ZSTD(3)), + duration_to_prev_ns UInt64 CODEC(Delta, ZSTD(3)), + + -- Context + hostname LowCardinality(String) CODEC(ZSTD(3)), + collection_id UUID CODEC(ZSTD(3)), + ingested_at DateTime64(3) DEFAULT now64(3) CODEC(Delta, ZSTD(3)), + + -- Data skipping indices + INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4, + INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4, + INDEX idx_duration duration_ns TYPE minmax GRANULARITY 2, + INDEX idx_syscall syscall_name TYPE set(100) GRANULARITY 4, + INDEX idx_filename_token filename TYPE tokenbf_v1(256, 2, 0) GRANULARITY 4 +) +ENGINE = MergeTree() +PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns)) +ORDER BY (timestamp_ns, pid, tid) +TTL timestamp_ns + INTERVAL 90 DAY DELETE +SETTINGS + index_granularity = 8192, + min_bytes_for_wide_part = '10M', + min_rows_for_wide_part = 100000; + +-- Projections for different access patterns +ALTER TABLE ior_events ADD PROJECTION proj_by_pid_tid ( + SELECT * ORDER BY (pid, tid, timestamp_ns) +); + +ALTER TABLE ior_events ADD PROJECTION proj_by_syscall_hour ( + SELECT + syscall_name, + toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour, + pid, + count() as cnt, + sum(duration_ns) as total_dur, + avg(duration_ns) as avg_dur + GROUP BY syscall_name, hour, pid +); + +ALTER TABLE ior_events MATERIALIZE PROJECTION proj_by_pid_tid; +ALTER TABLE ior_events MATERIALIZE PROJECTION proj_by_syscall_hour; +``` + +### C.5 Query Optimization Tips + +```sql +-- 1. Always include time range (partition pruning) +-- GOOD +SELECT * FROM ior_events +WHERE timestamp_ns > now() - INTERVAL 1 HOUR AND tid = 12345; + +-- BAD (scans all partitions) +SELECT * FROM ior_events WHERE tid = 12345; + +-- 2. Use PREWHERE for filter pushdown on large columns +SELECT count() +FROM ior_events +PREWHERE syscall_name = 'open' -- Filter before reading other columns +WHERE timestamp_ns > now() - INTERVAL 1 HOUR; + +-- 3. Leverage projections explicitly +SET allow_experimental_projection_optimization = 1; +SELECT syscall_name, sum(cnt) +FROM ior_events +WHERE timestamp_ns > now() - INTERVAL 1 DAY +GROUP BY syscall_name; -- Will use proj_by_syscall_hour + +-- 4. Use materialized views for frequent aggregations +SELECT * FROM ior_thread_stats_mv WHERE day = today(); + +-- 5. Parallel replica reads for large scans +SET max_parallel_replicas = 3; +SELECT count() FROM ior_events WHERE timestamp_ns > now() - INTERVAL 7 DAY; +``` + +### C.6 Storage Estimation + +For 1 billion events (1 day at ~11.5K events/sec): + +| Metric | Estimate | +|--------|----------| +| Raw row size | ~200 bytes | +| Uncompressed | ~200 GB | +| With ZSTD compression | ~30-50 GB | +| With projections | +20-30% additional | +| Bloom filter indices | +1-2% | +| Total storage per day | ~40-65 GB | + +With 90-day TTL: ~3.6-6 TB total storage. + +### C.7 Server Configuration + +```xml +<!-- /etc/clickhouse-server/config.d/ior.xml --> +<clickhouse> + <!-- For high-cardinality + large data --> + <merge_tree> + <index_granularity>8192</index_granularity> + <min_bytes_for_wide_part>10485760</min_bytes_for_wide_part> + <min_rows_for_wide_part>100000</min_rows_for_wide_part> + <max_suspicious_broken_parts>5</max_suspicious_broken_parts> + </merge_tree> + + <!-- Memory for queries --> + <max_memory_usage>16000000000</max_memory_usage> <!-- 16GB --> + <max_memory_usage_for_all_queries>20000000000</max_memory_usage_for_all_queries> + + <!-- Insert performance --> + <max_insert_block_size>1048576</max_insert_block_size> + <min_insert_block_size_rows>100000</min_insert_block_size_rows> + + <!-- Background merges --> + <background_pool_size>16</background_pool_size> + <background_merges_mutations_concurrency_ratio>4</background_merges_mutations_concurrency_ratio> +</clickhouse> +``` + +--- + +## Appendix B: Alternative Approaches Considered + +| Approach | Pros | Cons | Verdict | +|----------|------|------|---------| +| Arrow IPC | Standard, fast | Larger files | Good for streaming, not file storage | +| Protobuf | Compact, schema evolution | Requires CH parsing | Overkill | +| CSV | Simple | No compression, no types | Not suitable | +| Native format | Most efficient for CH | Not portable | Consider for streaming |
\ No newline at end of file |
