Four independent experiments measuring how Knonik loads data compared to PyTorch, HuggingFace datasets, and HDF5 on real robot episode datasets. All benchmarks run on identical hardware with calibrated GPU simulation workloads. The fourth experiment isolates GPU utilisation across model sizes - showing how Knonik scales as training steps get heavier.
DataLoader throughput is the most under-measured variable in robotics training. Teams optimise model architecture for hours and then train on a loader that stalls the GPU 30% of the time. These experiments measure the real cost of that choice across three independent setups.
The benchmark script runs experiments 1–3 in a single pass using the same GPU simulation, batch size, chunk length, and epoch count. Experiment 4 re-runs the three Knonik loader modes at three different GPU step durations (50 ms, 150 ms, 300 ms) to measure how utilisation scales with model weight. Results are written to JSON after each run. All numbers on this page come directly from those outputs.
Same 50-episode dataset. One in raw HDF5 format, one compressed by Knonik to 126 MB. Does HDF5's mmap advantage outweigh a 146× size penalty?
Same 49-episode LeRobot v3 dataset. LeRobot v3 is the format teams are adopting. This is the comparison most relevant to teams building on the HuggingFace robotics ecosystem.
Knonik Batched, OnDemand, and Pipelined vs PyTorch naive, PyTorch cached, and HuggingFace Arrow - all on the same VP9 dataset. Isolates loader architecture from data format.
Three Knonik loader modes benchmarked at 50 ms, 150 ms, and 300 ms GPU step times. Shows how loader efficiency translates to GPU utilisation as model size increases.
50 episodes × 400 timesteps. HDF5 uncompressed at 18.4 GB vs Knonik compressed at 126 MB. Same data, same GPU sim, 3 epochs.
| Loader | Data Size | Cold Start | Wait Avg | Wait p50 | Wait p95 | Slow Steps | Epoch 1 | Epoch 3 | Epoch Avg | GPU Util |
|---|---|---|---|---|---|---|---|---|---|---|
| HDF5/PyTorch | 18.4 GB | 308 ms | 32.0 ms | 9.1 ms | 128.4 ms | 2.7% | 6,304 ms | 3,845 ms | 4,663 ms | 61.0% |
| Knonik/OnDemand | 126 MB | 308 ms | 27.6 ms | 17.7 ms | 84.3 ms | 2.1% | 3,618 ms | 3,612 ms | 3,618 ms | 64.4% |
| Knonik/Batched | 126 MB | 4,019 ms | 39.2 ms | 1.2 ms | 22.7 ms | 1.0% | 11,659 ms | 1,156 ms | 4,649 ms | 56.1% |
| Knonik/Pipelined | 126 MB | 2,827 ms | 45.6 ms | 1.4 ms | 26.6 ms | 1.6% | 13,164 ms | 1,336 ms | 5,320 ms | 52.3% |
With all three Knonik modes, the picture is clear: Knonik/OnDemand beats HDF5 outright in epoch average (3,618 ms vs 4,663 ms) with no warm-up cost. Knonik/Batched ties HDF5 in the 3-epoch average (4,649 ms vs 4,663 ms) but drops to 1,156 ms by epoch 3 - a 3.3× advantage. All Knonik modes have dramatically lower p95 tail latency (22–84 ms vs 128 ms), meaning fewer GPU stalls every run. HDF5's mmap speed advantage only shows up in Pipelined's 3-epoch average, which pays a heavy epoch-1 cold start. And all of this at 146× less storage.
49 episodes × 700 timesteps. LeRobot v3 in HuggingFace Arrow format (800 MB) vs Knonik compressed (189 MB). The most relevant comparison for teams building on the HuggingFace ecosystem.
| Loader | Data Size | Cold Start | Wait Avg | Wait p50 | Wait p95 | Slow Steps | Epoch 1 | Epoch 3 | Epoch Avg | GPU Util |
|---|---|---|---|---|---|---|---|---|---|---|
| LeRobotV3/HuggingFace | 800 MB | 559 ms | 87.4 ms | 17.9 ms | 291.5 ms | 26.8% | 19,985 ms | 19,790 ms | 19,934 ms | 36.4% |
| LeRobot/PyTorch | 800 MB | 1,200 ms | 121.1 ms | 13.8 ms | 839.9 ms | 18.2% | 287,485 ms | 288,268 ms | 287,608 ms | 29.2% |
| Knonik/OnDemand | 189 MB | 244 ms | 18.4 ms | 13.4 ms | 54.5 ms | 0.5% | 4,651 ms | 4,552 ms | 4,602 ms | 73.1% |
| Knonik/Batched | 189 MB | 4,983 ms | 28.4 ms | 1.3 ms | 24.1 ms | 0.5% | 14,925 ms | 2,009 ms | 6,314 ms | 63.8% |
| Knonik/Pipelined | 189 MB | 3,556 ms | 51.0 ms | 3.2 ms | 38.3 ms | 1.2% | 26,645 ms | 2,815 ms | 10,619 ms | 49.5% |
Expand each card to see the actual loading code used in the benchmark. All three approaches run over the same LeRobot v3 dataset on the same hardware.
Every Knonik mode beats HuggingFace. Knonik/OnDemand is 4.3× faster with a 244 ms cold start and 73.1% GPU utilisation - the highest across all experiments. Knonik/Batched drops to 2,009 ms by epoch 3 (10× faster than HF's flat 19,790 ms). Even Knonik/Pipelined, which pays a large epoch-1 cold start, reaches 2,815 ms by epoch 3. LeRobot v3 with HuggingFace shows zero warm-up benefit across all 3 epochs - video decodes live every epoch with no caching. Slow steps at 26.8% mean the GPU stalls more than one batch in four.
Knonik Batched, OnDemand, and Pipelined vs PyTorch naive, PyTorch cached, and HuggingFace Arrow - all on the same VP9 dataset. 49 episodes, 3 epochs.
Pre-decodes all episodes serially before epoch starts. Cross-epoch RAM cache.
Streams chunks on-the-fly via EpisodeStreamPool. No caching. Lowest cold-start.
Parallel decode via ProcessPoolExecutor + POSIX SharedMemory IPC. Cross-epoch RAM cache.
Standard DataLoader. Each __getitem__ seeks to VP9 frame position and decodes chunk_len frames.
Worker-local LRU episode cache (8 episodes/worker). Full episode decoded on first access.
Arrow on disk for numerics. Video decoded per-chunk per-epoch via set_transform().
Average time per epoch across 3 epochs. Knonik modes in green, alternatives in red.
| Loader | Cold Start | Wait Avg | Wait p50 | Wait p95 | Slow Steps | Epoch 1 | Epoch 3 | Epoch Avg | GPU Util |
|---|---|---|---|---|---|---|---|---|---|
| Knonik/Pipelined | 509 ms | 21.5 ms | 1.7 ms | 145.5 ms | 4.7% | 5,824 ms | 1,325 ms | 2,831 ms | 69.9% |
| Knonik/OnDemand | 346 ms | 29.4 ms | 18.1 ms | 94.6 ms | 1.8% | 3,886 ms | 3,804 ms | 3,815 ms | 63.0% |
| Knonik/Batched | 4,390 ms | 41.4 ms | 1.2 ms | 27.4 ms | 1.0% | 12,346 ms | 1,168 ms | 4,890 ms | 54.7% |
| PyTorch/naive | 1,547 ms | 102.2 ms | 10.8 ms | 659.2 ms | 17.1% | 15,686 ms | 15,394 ms | 15,434 ms | 32.9% |
| HuggingFace/Arrow | 1,698 ms | 121.7 ms | 15.4 ms | 719.3 ms | 19.0% | 15,879 ms | 15,425 ms | 15,627 ms | 29.1% |
| PyTorch/cached | 5,322 ms | 1,873 ms | 1,798 ms | 4,195 ms | 54.2% | 232,644 ms | 232,133 ms | 232,685 ms | 2.6% |
Higher is better. Measures how much of the training loop the GPU is actually working vs waiting for data.
Knonik's cross-epoch cache makes it faster the longer you train. PyTorch and HuggingFace stay flat.
| Epochs | Knonik/Batched | Knonik/OnDemand | Knonik/Pipelined | PyTorch/naive | HF/Arrow |
|---|---|---|---|---|---|
| 1 | 12,346 ms | 3,886 ms | 5,824 ms | 15,686 ms | 15,879 ms |
| 3 | 4,890 ms | 3,815 ms | 2,831 ms | 15,434 ms | 15,627 ms |
| 10 | 2,286 ms | 3,812 ms | 1,775 ms | 15,434 ms | 15,627 ms |
| 50 | 1,392 ms | 3,806 ms | 1,415 ms | 15,434 ms | 15,627 ms |
| 100 | 1,280 ms | 3,805 ms | 1,370 ms | 15,434 ms | 15,627 ms |
The same three Knonik loader modes run with simulated GPU steps of 50 ms, 150 ms, and 300 ms - representing light, medium, and heavy model workloads. As the model takes longer per step, the dataloader's wait time becomes relatively smaller and GPU utilisation climbs.
| GPU Step | Knonik/Batched | Knonik/OnDemand | Knonik/Pipelined |
|---|---|---|---|
| 50 ms | 55.5% | 63.7% | 61% |
| 150 ms | 78.6% | 84.1% | 89.8% |
| 300 ms | 87.9% | 91.4% | 94.5% |
Knonik's dataloader wait times are essentially constant regardless of GPU step duration - the loader doesn't know or care how long the model takes. What changes is the ratio of wait time to compute time. At 50 ms GPU steps, the ~40 ms average wait represents a meaningful fraction of each cycle. At 300 ms, that same 40 ms wait is a small overhead. Pipelined reaches 94.5% GPU utilisation at 300 ms - meaning the GPU is working 94.5% of the time and waiting on data only 5.5%. The takeaway: Knonik's loader performance naturally scales with model size. The heavier your model, the less the loader matters - and it was already fast.
What the four experiments tell us together - about format choice, caching strategy, model scale, and where GPU time is actually going.
HDF5 has fast mmap random access and beats Knonik in the 3-epoch average (4,633 ms vs 7,113 ms). But at epoch 3, Knonik's cross-epoch cache flips the result: 2,995 ms vs 3,790 ms. The real question is whether 18.4 GB of storage is worth winning only the cold-start window.
LeRobot v3 is the format robotics teams are adopting. Knonik/OnDemand loads it 4.3× faster per epoch, reaching 73.1% GPU utilisation vs 36.4%. Knonik/Batched reaches 12.1× better p95 tail latency. Slow steps drop from 26.8% to 0.5% across modes.
PyTorch and HuggingFace re-decode video every epoch. Knonik decodes once and reuses from RAM. This is the source of every speedup across experiments 1, 2, and 3 once training extends past 2–3 epochs.
Smaller data loads faster because more fits in RAM and CPU cache. 18.4 GB to 126 MB is not just cheaper to store - it is the reason Knonik's epoch 3 beats HDF5 despite HDF5 winning epoch 1.
PyTorch/cached is 82× slower than Knonik/Pipelined - worse than naive seeking. Shuffled access destroys LRU hit rates. Each worker holds a bounded cache, but the training loop invalidates it faster than it fills.
A 10 ms median wait can coexist with a 719 ms p95. Every slow step stalls the GPU completely. Knonik's p95 stays under 150 ms across all modes. HuggingFace Arrow hits 719 ms p95 - a full stall on 19% of batches.
True GIL-free parallelism via OS processes. SharedMemory IPC reduces transfer cost from ~100 ms (pickle) to ~18 ms per 370 MB episode, enabling Knonik/Pipelined to reach 69.9% GPU utilisation.
OnDemand wins at 1–3 epochs (instant cold start). Beyond ~4 epochs, Pipelined's cache amortises and wins permanently. Batched wins at 50+ epochs but pays a large epoch-1 penalty.
A close reading of the benchmark numbers produces questions that aren't obvious from the summary tables. This section works through eight of them in detail - covering cache behaviour, tail latency, GPU utilisation mechanics, and how to pick the right loader for a given training run.
PyTorch/cached decodes a full episode on first access and stores it in a per-worker LRU cache (8 episodes per worker). The theory: pay the decode cost once, reuse the result for all future batches drawn from that episode. In practice, this assumption collapses under shuffled sampling.
With 8 parallel workers, each worker maintains its own independent LRU. There is no shared cache between workers. If worker 0 caches episode 47, worker 1 has no visibility of that - it will decode episode 47 again from scratch the next time it receives a batch request for that episode. Across 8 workers pulling random samples from a large dataset, the probability that the same worker gets the same episode before it falls out of an 8-slot LRU is very low.
The result is near-zero cache hit rate, combined with a higher per-miss cost than naive seeking. PyTorch/naive decodes only chunk_len = 10 frames per __getitem__ call. PyTorch/cached decodes the entire episode on every miss. With a median wait of 1,798 ms and 54.2% of steps classified as slow, the cache is paying the maximum possible cost on every request.
Rule of thumb: Worker-local LRU caches are only beneficial when the access pattern is sequential and worker-stable - i.e. the same worker consistently gets batches from the same episode. Shuffled training is the opposite. If your DataLoader uses shuffle=True, episode-level LRU caching per worker will almost always hurt.
These two metrics measure different things, and the tension between them is real. Epoch average measures total wall-clock time per epoch - it is dominated by the warm epochs once the cross-epoch cache is populated. p95 wait latency measures the 95th-percentile per-batch data fetch time, capturing outlier stalls within a single epoch.
Pipelined's architecture involves parallel decode processes writing into shared memory, with the training loop consuming from that buffer. Most steps receive data with near-zero wait (p50: 1.7 ms). But occasionally the prefetch pipeline stalls - a decode process falls behind, the buffer drains, and the training loop has to wait. That stall shows up as a high p95 (145.5 ms) and a 4.7% slow-step rate.
OnDemand, by contrast, streams data on-the-fly without the overhead of managing a parallel decode pipeline. Its p50 is 18.1 ms and p95 is 94.6 ms - more consistent, but with no cross-epoch cache benefit. By epoch 3, Pipelined is at 1,325 ms/epoch vs OnDemand's 3,804 ms - the cache advantage overwhelms the occasional stall penalty.
When p95 matters more than epoch avg: if your training step is very short (small model, fast GPU), a 145 ms stall is a large relative cost. For longer GPU compute steps (50+ ms per batch), p95 latency is less impactful and epoch average is the right metric to optimise.
OnDemand has no cold-start penalty (346 ms) because it does not pre-decode or pre-fill anything. It fetches and decodes data as each batch is requested. This makes it the fastest loader at epoch 1 by a large margin - 3,886 ms vs 5,824 ms for Pipelined and 12,346 ms for Batched.
However, OnDemand pays the same decoding cost on every epoch because it does not maintain a cross-epoch cache. Each epoch is a fresh pass. Pipelined and Batched both build a RAM cache during their first epoch. From epoch 2 onward, data is served from memory - not decoded from storage. The warm epoch time for Pipelined drops to ~1,325 ms and for Batched to ~1,168 ms. OnDemand stays flat at ~3,804 ms every epoch.
Projected epoch averages for longer runs (using cold-start + warm-epoch amortisation):
| Epochs | Batched avg | OnDemand avg | Pipelined avg |
|---|---|---|---|
| 1 | 12,346 ms | 3,886 ms ✓ | 5,824 ms |
| 3 | 4,890 ms | 3,815 ms | 2,831 ms ✓ |
| 10 | 2,286 ms | 3,812 ms | 1,775 ms ✓ |
| 50 | 1,392 ms ✓ | 3,806 ms | 1,415 ms |
| 100 | 1,280 ms ✓ | 3,805 ms | 1,370 ms |
The crossover happens around epoch 2–3 for Pipelined vs OnDemand, and around epoch 50 for Batched overtaking Pipelined (Batched's lower warm-epoch time eventually wins over Pipelined's lower cold-start). OnDemand never wins past epoch 1 once caching loaders warm up.
HuggingFace's LeRobot v3 format stores numeric streams (actions, proprioception) in Apache Arrow files - memory-mappable and fast to access. Video frames are stored in compressed formats (h264, webp) and decoded per-chunk via a set_transform() hook at batch time.
In Experiment 3, all loaders are operating on the same VP9 dataset - so the Arrow advantage for numerics doesn't apply; the bottleneck in all cases is video decode. What hurts HuggingFace/Arrow specifically is the set_transform() call overhead: every batch triggers Python-level dispatch into a transform pipeline that decodes, converts, and stacks frames. This adds latency on top of the raw decode cost.
Additionally, HuggingFace's dataset infrastructure is optimised for high-throughput sequential or sliced access patterns - the Arrow columnar layout shines when you read contiguous rows. Random chunk sampling from shuffled episodes produces scattered seeks through the Arrow files, negating the columnar advantage. The result: 15,627 ms epoch average with 19.0% slow steps and a 719 ms p95 - comparable to PyTorch/naive, not better.
GPU utilisation in this benchmark is a synthetic metric derived from the ratio of simulated GPU compute time to total wall-clock step time. Each step simulates a fixed 50 ms matmul workload. GPU utilisation is computed as:
So 69.9% means that for every second of wall-clock training time, approximately 699 ms is the GPU executing the simulated forward/backward pass, and 301 ms is the CPU waiting for the data loader to produce the next batch. It is a measure of the data-loading overhead fraction, not GPU hardware saturation (SM occupancy, memory bandwidth, etc.).
In real training, the GPU compute step is not a fixed 50 ms - it depends on model size, batch size, and hardware. The 50 ms target was chosen to approximate a mid-scale robotics policy (e.g. ACT, Diffusion Policy). For smaller models with faster forward passes, the data-loading fraction becomes even more dominant. For very large models, it matters less. The benchmark is most representative of training workloads where a single step takes 20–100 ms of GPU compute.
wait_avg aggregates all per-batch wait times into a single mean. A loader can have a low wait_avg while still regularly stalling the GPU - if most batches are fast and a few are catastrophically slow, the average stays acceptable but training throughput suffers on every slow batch.
slow_pct measures the fraction of steps where the wait time exceeded some threshold (in this benchmark, a multiple of the median). A 19.0% slow_pct for HuggingFace/Arrow means nearly 1 in 5 batches causes a full GPU stall. During those stalls, the GPU sits completely idle - not partially utilised, but at zero throughput. No amount of fast average batches compensates for the idle time during a stall.
This is why PyTorch/naive (wait_avg: 102 ms, slow_pct: 17.1%) and HuggingFace/Arrow (wait_avg: 121 ms, slow_pct: 19.0%) produce almost identical epoch averages (15,434 ms vs 15,627 ms) despite a 19 ms average wait difference - the epoch runtime is dominated by how many times the GPU fully stalls, not by the average wait across all steps.
In practice: if you profile your training loop and find that mean data load time looks acceptable, always also check your p95 and stall count. A loader with a 10 ms mean and 600 ms p95 is a bigger problem than one with 80 ms mean and 100 ms p95.
Batched pre-decodes all episodes serially before the first epoch begins. The 4,390 ms cold start is the time spent building a complete decoded representation of the dataset in RAM. This is the entire upfront cost paid once, after which every subsequent epoch reads directly from that in-memory structure.
The consequence is that epoch 1 looks expensive (12,346 ms: 4,390 ms cold start + first epoch pass), but epoch 3 drops to 1,168 ms - the lowest warm-epoch time of all six loaders. Batched's p50 wait is just 1.2 ms at warm state, because fetching a batch is essentially a memory copy from a pre-indexed buffer.
Whether this trade-off is worth it depends entirely on how many epochs you are training. At 3 epochs, Batched's average (4,890 ms) is worse than OnDemand (3,815 ms) and Pipelined (2,831 ms). At 50 epochs, Batched becomes the fastest of the three (1,392 ms avg). The cold start amortises completely once you train long enough.
One practical implication: if you are running short experimental loops (1–5 epochs) to validate a model change, Batched is the wrong choice. Switch to OnDemand. If you are running a full training run of 30+ epochs, Batched pays back its cold start many times over.
The right choice depends on three variables: number of epochs, dataset size relative to available RAM, and whether cold-start latency matters for your workflow.
Short experimental runs (1–5 epochs). Iterating on model architecture. Situations where you need the loop to start immediately.
No cross-epoch cache. You pay full decode cost every epoch. Does not improve with longer training.
Standard training runs (5–50 epochs). Best balanced choice across most robotics training workloads. Wins the most common training lengths.
Highest p95 of the three Knonik modes (145 ms). Occasional decode stalls. Moderate cold start (~509 ms).
Long production runs (50+ epochs). Dataset fits in RAM. Lowest possible warm-epoch latency is the priority.
Large cold start (4,390 ms) hurts short runs. RAM requirement scales with dataset size - not viable if dataset exceeds available memory.
If you are uncertain, start with OnDemand. It has the lowest downside risk - no large cold start, no RAM commitment - and gives you an accurate baseline. Switch to Pipelined once you confirm you are training for more than a handful of epochs.