Training robot learning models is bottlenecked not just by compute, but by how fast data reaches the GPU. These benchmarks measure whether Knonik keeps the GPU fed under realistic training conditions, comparing it directly against PyTorch DataLoaders and HuggingFace datasets. GPU compute is simulated at six step times (50 to 300 ms), representing everything from fast inference loops to large models. Each batch iteration is split into three phases: wait (GPU idle time), H2D transfer (CPU to GPU), and compute (simulated forward/backward pass).
Each test is designed to answer a specific question about dataloader performance across different dataset formats and toolchains.
Runs all loader variants over the same robot dataset: Knonik (Batched, Pipelined, OnDemand, with and without mmap), a naive PyTorch DataLoader with per-chunk video seeking, a cached PyTorch DataLoader with LRU episode caching, and a HuggingFace Arrow-backed loader. Establishes the performance envelope of each approach under identical conditions.
Compares a standard PyTorch DataLoader reading raw HDF5 files against Knonik loading the same dataset in Knonik-compressed format. HDF5 is the most common format for robot datasets (used in ACT, Diffusion Policy, etc.), so this test directly quantifies the cost of staying uncompressed.
Compares LeRobot/PyTorch-delta10, LeRobotV3/HuggingFace Arrow, and Knonik v3 OnDemand on the same LeRobot v3 format dataset under identical conditions.
Total wall-clock time for one full pass through the dataset, averaged across all epochs. This is the headline number: lower means the GPU finishes training faster. A loader with fast individual batches can still have a poor epoch average if it stalls between batches.
Time for the first and final epoch separately. The gap between these reveals warm-up behavior: loaders that cache decoded data or prefetch aggressively often see large improvements after the first epoch, while loaders doing redundant work stay flat.
Average time the training loop sits idle between batches, from when the GPU finished the last step to when the next batch is delivered. This is the most direct measure of data pipeline efficiency. A well-prefetched loader should deliver batches faster than the GPU can consume them, keeping wait near zero.
The 95th-percentile batch wait time. While averages can be misleading if a few fast batches mask frequent stalls, P95 exposes tail latency: the worst 1-in-20 batch delays. High P95 values indicate unpredictable spikes, typically from cold video decoding, I/O contention, or cache misses.
Average time spent on the simulated GPU forward/backward pass per step. Controlled by the target GPU latency setting. It serves as the denominator for understanding how much of the step time is compute-bound versus data-bound.
Estimated fraction of epoch time the GPU spends doing compute rather than waiting for data. Computed as 1 minus (wait_avg / total_step_avg). A loader at 99% GPU utilization means the data pipeline adds virtually no overhead. At 50%, the GPU is idle half the time waiting for batches.
Percentage of batch steps where the wait time exceeded 150 ms, a threshold above which data latency visibly impacts GPU throughput. High slow-step rates indicate reliability problems in the loader, not just average-latency problems.
The wait time for the very first batch of the run. This reflects loader initialization cost: spawning workers, opening files, warming up prefetch queues. A high cold start is acceptable if subsequent batches are fast, but for short jobs or repeated experiments, it matters.
All loader variants head-to-head on the same robot dataset: Knonik (Batched, Pipelined, OnDemand, with and without mmap disk cache), a PyTorch DataLoader with per-chunk video seeking, a PyTorch DataLoader with LRU episode caching, and a HuggingFace Arrow-backed loader. All loaders see identical data, epoch count, batch size, and worker count.
Select the target GPU step time to inspect detailed results.
| Loader | Epoch Avg | Epoch 1 | Epoch Last | Wait Avg | Wait P95 | Compute Avg | GPU Util | Slow Steps | Cold Start | Steps |
|---|---|---|---|---|---|---|---|---|---|---|
| Knonik/Pipelined | 11124.87 ms | 10865.29 ms | 11353.01 ms | 2.53 ms | 0.88 ms | 97.88 ms | 97.48% | 0.30% | 561.03 ms | 332 |
| Knonik/Pipelined+mmap | 11867.04 ms | 11878.98 ms | 11875.12 ms | 2.88 ms | 0.89 ms | 103.58 ms | 97.30% | 0.30% | 676.16 ms | 334 |
| Knonik/OnDemand | 12028.46 ms | 11786.71 ms | 12216.72 ms | 3.17 ms | 0.59 ms | 103.17 ms | 97.02% | 0.88% | 353.38 ms | 339 |
| Knonik/Batched | 13147.49 ms | 15755.60 ms | 11889.15 ms | 17.11 ms | 0.94 ms | 102.66 ms | 85.72% | 0.61% | 4652.14 ms | 329 |
| Knonik/Batched+mmap | 14271.71 ms | 19401.26 ms | 11807.67 ms | 29.06 ms | 1.13 ms | 100.95 ms | 77.66% | 0.91% | 5954.36 ms | 329 |
| PyTorch/naive | 17602.33 ms | 17736.56 ms | 17611.55 ms | 28.37 ms | 69.53 ms | 95.20 ms | 80.17% | 2.71% | 1224.06 ms | 369 |
| HuggingFace/Arrow | 17875.77 ms | 17960.78 ms | 18032.91 ms | 40.69 ms | 111.62 ms | 104.51 ms | 72.00% | 3.25% | 996.33 ms | 369 |
| PyTorch/cached | 113238.12 ms | 112545.50 ms | 114966.89 ms | 816.03 ms | 5796.79 ms | 83.92 ms | 11.36% | 26.83% | 7394.00 ms | 369 |
Toggle latency columns to compare loaders across different GPU step times.
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Knonik/OnDemand | 6947.52 ms | 12028.46 ms | 14344.75 ms | 21005.55 ms | 31178.94 ms | 49677.34 ms |
| Knonik/Pipelined | 7109.17 ms | 11124.87 ms | 14434.05 ms | 21408.47 ms | 30776.18 ms | 46518.91 ms |
| Knonik/Pipelined+mmap | 7220.33 ms | 11867.04 ms | 15025.73 ms | 21812.50 ms | 30615.88 ms | 48338.39 ms |
| Knonik/Batched | 10081.72 ms | 13147.49 ms | 14838.57 ms | 22386.07 ms | 32749.01 ms | 47793.26 ms |
| Knonik/Batched+mmap | 10673.37 ms | 14271.71 ms | 16260.71 ms | 23661.97 ms | 32029.54 ms | 49448.97 ms |
| PyTorch/naive | 17161.19 ms | 17602.33 ms | 21087.95 ms | 26657.59 ms | 36226.32 ms | 50790.25 ms |
| HuggingFace/Arrow | 17212.71 ms | 17875.77 ms | 19132.83 ms | 27207.12 ms | 36939.51 ms | 53552.00 ms |
| PyTorch/cached | 110949.90 ms | 113238.12 ms | 110640.48 ms | 114023.91 ms | 115950.91 ms | 101944.83 ms |
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Knonik/OnDemand | 94.86 % | 97.02 % | 97.52 % | 98.26 % | 98.87 % | 99.26 % |
| Knonik/Pipelined | 96.00 % | 97.48 % | 98.10 % | 98.66 % | 99.09 % | 99.40 % |
| Knonik/Pipelined+mmap | 94.89 % | 97.30 % | 97.91 % | 98.53 % | 98.94 % | 99.34 % |
| Knonik/Batched | 71.62 % | 85.72 % | 88.96 % | 92.62 % | 94.87 % | 96.49 % |
| Knonik/Batched+mmap | 60.35 % | 77.66 % | 84.35 % | 91.18 % | 93.34 % | 95.73 % |
| PyTorch/naive | 53.59 % | 80.17 % | 87.76 % | 90.92 % | 93.60 % | 95.25 % |
| HuggingFace/Arrow | 38.13 % | 72.00 % | 76.86 % | 85.56 % | 89.67 % | 92.99 % |
| PyTorch/cached | 8.35 % | 11.36 % | 14.03 % | 18.85 % | 24.72 % | 44.13 % |
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Knonik/OnDemand | 0.62 ms | 0.59 ms | 0.59 ms | 0.64 ms | 0.63 ms | 0.69 ms |
| Knonik/Pipelined | 0.93 ms | 0.88 ms | 0.95 ms | 0.76 ms | 0.74 ms | 0.63 ms |
| Knonik/Pipelined+mmap | 0.95 ms | 0.89 ms | 0.75 ms | 0.73 ms | 0.66 ms | 0.58 ms |
| Knonik/Batched | 1.03 ms | 0.94 ms | 0.91 ms | 0.79 ms | 0.72 ms | 0.75 ms |
| Knonik/Batched+mmap | 0.84 ms | 1.13 ms | 0.88 ms | 0.69 ms | 0.67 ms | 0.62 ms |
| PyTorch/naive | 350.48 ms | 69.53 ms | 16.49 ms | 15.72 ms | 14.51 ms | 15.36 ms |
| HuggingFace/Arrow | 480.37 ms | 111.62 ms | 82.25 ms | 64.04 ms | 73.38 ms | 64.19 ms |
| PyTorch/cached | 5849.99 ms | 5796.79 ms | 3486.36 ms | 5500.18 ms | 5276.42 ms | 2448.90 ms |
A PyTorch DataLoader reading raw HDF5 files compared against Knonik loading the same data in Knonik-compressed format. HDF5 is the dominant format in robot learning codebases (ACT, Diffusion Policy, ALOHA), so this test quantifies the cost of staying uncompressed.
Select the target GPU step time to inspect detailed results.
| Loader | Epoch Avg | Epoch 1 | Epoch Last | Wait Avg | Wait P95 | Compute Avg | GPU Util | Slow Steps | Cold Start | Steps |
|---|---|---|---|---|---|---|---|---|---|---|
| Knonik/test-Pipelined | 11139.66 ms | 11044.93 ms | 11157.82 ms | 2.46 ms | 0.89 ms | 97.48 ms | 97.54% | 0.30% | 549.17 ms | 334 |
| Knonik/test-OnDemand | 11458.38 ms | 11320.43 ms | 11506.53 ms | 3.12 ms | 0.55 ms | 98.16 ms | 96.92% | 0.88% | 378.62 ms | 339 |
| Knonik/test-Batched | 13715.38 ms | 15985.30 ms | 12719.59 ms | 15.35 ms | 0.91 ms | 109.60 ms | 87.72% | 0.61% | 4421.64 ms | 329 |
| HDF5/PyTorch | 14331.42 ms | 13995.98 ms | 14666.33 ms | 10.08 ms | 10.82 ms | 104.45 ms | 91.20% | 0.80% | 360.84 ms | 375 |
Toggle latency columns to compare loaders across different GPU step times.
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Knonik/test-OnDemand | 7052.02 ms | 11458.38 ms | 14620.52 ms | 19156.38 ms | 28739.06 ms | 46602.07 ms |
| Knonik/test-Pipelined | 8016.29 ms | 11139.66 ms | 14434.32 ms | 20468.30 ms | 30698.24 ms | 49449.93 ms |
| Knonik/test-Batched | 9410.90 ms | 13715.38 ms | 14982.71 ms | 21862.60 ms | 31887.81 ms | 47644.03 ms |
| HDF5/PyTorch | 9818.50 ms | 14331.42 ms | 17334.10 ms | 25212.11 ms | 37186.92 ms | 51858.74 ms |
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Knonik/test-OnDemand | 94.87 % | 96.92 % | 97.48 % | 98.14 % | 98.77 % | 99.25 % |
| Knonik/test-Pipelined | 96.57 % | 97.54 % | 98.03 % | 98.55 % | 99.08 % | 99.43 % |
| Knonik/test-Batched | 70.19 % | 87.72 % | 89.53 % | 92.70 % | 94.95 % | 96.64 % |
| HDF5/PyTorch | 87.02 % | 91.20 % | 92.93 % | 95.37 % | 96.58 % | 97.73 % |
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Knonik/test-OnDemand | 0.56 ms | 0.55 ms | 0.64 ms | 0.60 ms | 0.61 ms | 0.61 ms |
| Knonik/test-Pipelined | 1.03 ms | 0.89 ms | 0.81 ms | 0.80 ms | 0.70 ms | 0.66 ms |
| Knonik/test-Batched | 1.14 ms | 0.91 ms | 1.09 ms | 0.87 ms | 0.63 ms | 0.65 ms |
| HDF5/PyTorch | 10.24 ms | 10.82 ms | 10.49 ms | 9.75 ms | 9.85 ms | 9.36 ms |
A LeRobot v3 format dataset with 4 camera streams plus action and state. Tests three loaders: the standard LeRobot/PyTorch-delta10 loader (10-step delta chunks), LeRobotV3/HuggingFace Arrow, and Knonik v3 OnDemand. All loaders process the same dataset at the same batch size and worker count.
Select the target GPU step time to inspect detailed results.
| Loader | Epoch Avg | Epoch 1 | Epoch Last | Wait Avg | Wait P95 | Compute Avg | GPU Util | Slow Steps | Cold Start | Steps |
|---|---|---|---|---|---|---|---|---|---|---|
| LeRobot/PyTorch-delta10 | 1049219.96 ms | 1052626.55 ms | 1044428.89 ms | 294.65 ms | 1101.69 ms | 82.96 ms | 39.79% | 90.97% | 4237.11 ms | 6,432 |
| LeRobotV3/HuggingFace | 47396.87 ms | 47345.75 ms | 47517.88 ms | 127.80 ms | 717.45 ms | 92.52 ms | 42.02% | 13.49% | 1899.48 ms | 645 |
| Knonik/v3-OnDemand | 23651.99 ms | 23393.36 ms | 23806.98 ms | 6.14 ms | 1.66 ms | 115.86 ms | 94.97% | 0.52% | 1107.15 ms | 581 |
Toggle latency columns to compare loaders across different GPU step times.
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| LeRobot/PyTorch-delta10 | 1038784.91 ms | 1049219.96 ms | 1042843.79 ms | 1055643.57 ms | 1113293.16 ms | 1332737.62 ms |
| LeRobotV3/HuggingFace | 46600.29 ms | 47396.87 ms | 47454.97 ms | 53355.45 ms | 65495.01 ms | 94582.25 ms |
| Knonik/v3-OnDemand | 19351.68 ms | 23651.99 ms | 27227.39 ms | 46508.11 ms | 61666.45 ms | 80102.92 ms |
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| LeRobot/PyTorch-delta10 | 34.05 % | 39.79 % | 45.38 % | 56.86 % | 67.15 % | 75.19 % |
| LeRobotV3/HuggingFace | 24.49 % | 42.02 % | 55.37 % | 80.96 % | 86.06 % | 90.94 % |
| Knonik/v3-OnDemand | 65.69 % | 94.97 % | 95.79 % | 97.54 % | 98.25 % | 98.54 % |
Cells highlighted in green are best for that latency.
| Loader | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| LeRobot/PyTorch-delta10 | 1327.28 ms | 1101.69 ms | 937.33 ms | 529.83 ms | 206.53 ms | 225.98 ms |
| LeRobotV3/HuggingFace | 1000.13 ms | 717.45 ms | 507.53 ms | 59.45 ms | 55.39 ms | 53.60 ms |
| Knonik/v3-OnDemand | 135.78 ms | 1.66 ms | 1.61 ms | 1.83 ms | 1.55 ms | 1.58 ms |
Shows the fastest epoch-average loader for each test at each GPU latency.
| Test | 50 ms | 80 ms | 100 ms | 150 ms | 200 ms | 300 ms |
|---|---|---|---|---|---|---|
| Baseline vs Knonik vs PyTorch vs HuggingFace | Knonik/OnDemand6947.52 ms | Knonik/Pipelined11124.87 ms | Knonik/OnDemand14344.75 ms | Knonik/OnDemand21005.55 ms | Knonik/Pipelined+mmap30615.88 ms | Knonik/Pipelined46518.91 ms |
| HDF5 (uncompressed) vs Knonik (compressed) | Knonik/test-OnDemand7052.02 ms | Knonik/test-Pipelined11139.66 ms | Knonik/test-Pipelined14434.32 ms | Knonik/test-OnDemand19156.38 ms | Knonik/test-OnDemand28739.06 ms | Knonik/test-OnDemand46602.07 ms |
| LeRobot v3 vs Knonik | Knonik/v3-OnDemand19351.68 ms | Knonik/v3-OnDemand23651.99 ms | Knonik/v3-OnDemand27227.39 ms | Knonik/v3-OnDemand46508.11 ms | Knonik/v3-OnDemand61666.45 ms | Knonik/v3-OnDemand80102.92 ms |
Key findings from both tests, based on measured benchmark data.
The starkest result in Test 1 is the PyTorch/cached loader. Despite using a per-worker LRU cache to eliminate redundant video decoding, it achieves only 8 to 44% estimated GPU utilization across the six latency targets. Wait P95 values range from 2,400 ms to over 5,800 ms, meaning at the 95th percentile the GPU is sitting idle for nearly six seconds waiting for a single batch. This is not a configuration problem. It reflects the fundamental overhead of multiprocess worker coordination, memory sharing, and collation in PyTorch DataLoader when video data is involved. Even at 300 ms GPU step time, the epoch average barely improves compared to 50 ms: the loader, not the compute, is the bottleneck.
All Knonik loader variants (Batched, Pipelined, and OnDemand) maintain wait P95 values between 0.59 ms and 0.95 ms across every tested latency. GPU utilization is consistently 94 to 99%. The data pipeline essentially disappears from the training loop: the GPU is never waiting for data in any measurable sense. The difference is most pronounced at low GPU latencies (50 to 100 ms), where data pipeline efficiency matters most. At 50 ms, Knonik achieves epoch times of 7,000 to 8,000 ms while PyTorch/naive takes 17,000 ms and PyTorch/cached takes over 110,000 ms.
Knonik's mmap variants (Batched+mmap, Pipelined+mmap) show small performance differences versus their in-memory counterparts, sometimes slightly faster, sometimes slightly slower depending on latency. The baseline dataset fits well within the OS page cache, meaning mmap's benefit of avoiding repeated decode work is already captured by the kernel's file cache. The mmap variants become more meaningful on larger datasets where decoded tensors exceed available RAM.
In Test 2, the HDF5/PyTorch loader reads uncompressed HDF5 files, the most common format in robot learning codebases. Despite HDF5 being an established format with chunked I/O support, it introduces substantial wait latency compared to Knonik loading the same data in Knonik-compressed format. This cost compounds across epochs: because HDF5 files are shared across workers via memory-mapped file handles, the loader struggles to hide I/O latency even with prefetching. Knonik's compressed format is designed specifically for chunk-sequential random access across episodes, which maps efficiently onto the access pattern of trajectory training.
Across both tests, the performance gap between Knonik and alternatives narrows as GPU step time increases. At 300 ms, even the PyTorch/naive loader can keep up reasonably well because the GPU is slow enough that any pipeline has time to prefetch the next batch. As model training gets faster through hardware improvements and efficient architectures, the dataloader becomes an increasingly critical bottleneck.