Throughput Benchmark
No rate limit. 16 producers push 1 KB records as fast as the broker can flush them to disk, 16 consumers read everything back in real time. 1 billion records in ~31 minutes: ~553K events/sec, ~540 MB/s sustained.
System metrics
Section titled “System metrics”What limits throughput
Section titled “What limits throughput”The bottleneck is EBS flush latency. Every durable flush is a network round-trip to the EBS volume, averaging 2.6ms on gp3. The disk writes 365 MB/s but the volume can sustain 625 MB/s — it sits idle 40% of the time waiting between flushes. Only 1,800 of the 16,000 provisioned IOPS are used. CPU is 83% idle.
Application-layer throughput (540 MB/s) is higher than disk throughput (365 MB/s) because Snappy compresses records to ~68% of their original size on the wire.
Local NVMe SSDs would reduce flush latency from 2.6ms to 10-100us and push throughput significantly higher. This test used standard gp3, the same storage most workloads run on.
When producers outrun the disk, klite back-pressures gracefully: writes queue up, producers wait, throughput stabilizes at the hardware limit. No crashes, no dropped data.
Reproduce
Section titled “Reproduce”./scripts/bench-aws.py up \ --klite-instance m7g.xlarge --bench-instance m7g.xlarge \ --ebs-throughput 1000 --ebs-iops 16000 --ebs-size 500
./scripts/bench-aws.py run \ --mode produce-consume \ --partitions 16 --producers 16 --consumers 16 \ --num-records 1000000000 --record-size 1024 \ --throughput -1 --acks 1 \ --max-buffered-records 1024 --warmup-records 50000 \ --wal-max-disk-size 53687091200
./scripts/bench-aws.py down