Back to Product
DataLoader Benchmark

Dataloader Performance Across 6 GPU Latency Steps

Training robot learning models is bottlenecked not just by compute, but by how fast data reaches the GPU. These benchmarks measure whether Knonik keeps the GPU fed under realistic training conditions, comparing it directly against PyTorch DataLoaders and HuggingFace datasets. GPU compute is simulated at six step times (50 to 300 ms), representing everything from fast inference loops to large models. Each batch iteration is split into three phases: wait (GPU idle time), H2D transfer (CPU to GPU), and compute (simulated forward/backward pass).

GPU
NVIDIA GeForce RTX 4070 Ti
Device
cuda
Batch Size
16
Chunk Length
10
Epochs
3
Workers
8
Test Overview

Three Benchmark Tests

Each test is designed to answer a specific question about dataloader performance across different dataset formats and toolchains.

Test 1
Baseline: Knonik vs PyTorch vs HuggingFace

Runs all loader variants over the same robot dataset: Knonik (Batched, Pipelined, OnDemand, with and without mmap), a naive PyTorch DataLoader with per-chunk video seeking, a cached PyTorch DataLoader with LRU episode caching, and a HuggingFace Arrow-backed loader. Establishes the performance envelope of each approach under identical conditions.

Test 2
HDF5 (PyTorch) vs Knonik Compressed

Compares a standard PyTorch DataLoader reading raw HDF5 files against Knonik loading the same dataset in Knonik-compressed format. HDF5 is the most common format for robot datasets (used in ACT, Diffusion Policy, etc.), so this test directly quantifies the cost of staying uncompressed.

HDF5: 18.4 GBKnonik: 126.2 MB
Test 3
LeRobot v3 (4 cameras) vs Knonik

Compares LeRobot/PyTorch-delta10, LeRobotV3/HuggingFace Arrow, and Knonik v3 OnDemand on the same LeRobot v3 format dataset under identical conditions.

Metrics Explained

What Each Metric Measures

Epoch Avg(ms)

Total wall-clock time for one full pass through the dataset, averaged across all epochs. This is the headline number: lower means the GPU finishes training faster. A loader with fast individual batches can still have a poor epoch average if it stalls between batches.

Epoch 1 vs Epoch Last(ms each)

Time for the first and final epoch separately. The gap between these reveals warm-up behavior: loaders that cache decoded data or prefetch aggressively often see large improvements after the first epoch, while loaders doing redundant work stay flat.

Wait Avg(ms)

Average time the training loop sits idle between batches, from when the GPU finished the last step to when the next batch is delivered. This is the most direct measure of data pipeline efficiency. A well-prefetched loader should deliver batches faster than the GPU can consume them, keeping wait near zero.

Wait P95(ms)

The 95th-percentile batch wait time. While averages can be misleading if a few fast batches mask frequent stalls, P95 exposes tail latency: the worst 1-in-20 batch delays. High P95 values indicate unpredictable spikes, typically from cold video decoding, I/O contention, or cache misses.

Compute Avg(ms)

Average time spent on the simulated GPU forward/backward pass per step. Controlled by the target GPU latency setting. It serves as the denominator for understanding how much of the step time is compute-bound versus data-bound.

GPU Util (est.)(%)

Estimated fraction of epoch time the GPU spends doing compute rather than waiting for data. Computed as 1 minus (wait_avg / total_step_avg). A loader at 99% GPU utilization means the data pipeline adds virtually no overhead. At 50%, the GPU is idle half the time waiting for batches.

Slow Steps(%)

Percentage of batch steps where the wait time exceeded 150 ms, a threshold above which data latency visibly impacts GPU throughput. High slow-step rates indicate reliability problems in the loader, not just average-latency problems.

Cold Start(ms)

The wait time for the very first batch of the run. This reflects loader initialization cost: spawning workers, opening files, warming up prefetch queues. A high cold start is acceptable if subsequent batches are fast, but for short jobs or repeated experiments, it matters.

Test 1
Baseline vs Knonik vs PyTorch vs HuggingFace

All loader variants head-to-head on the same robot dataset: Knonik (Batched, Pipelined, OnDemand, with and without mmap disk cache), a PyTorch DataLoader with per-chunk video seeking, a PyTorch DataLoader with LRU episode caching, and a HuggingFace Arrow-backed loader. All loaders see identical data, epoch count, batch size, and worker count.

Fastest Epoch Avg
Knonik/Pipelined
11124.87 ms
Highest GPU Util
Knonik/Pipelined
97.48%
Best Wait P95
Knonik/OnDemand
0.59 ms
Lowest Cold Start
Knonik/OnDemand
353.38 ms
Epoch Avg at 80 ms — shorter bar is faster
Knonik/Pipelined11125 ms
Knonik/Pipelined+mmap11867 ms
Knonik/OnDemand12028 ms
Knonik/Batched13147 ms
Knonik/Batched+mmap14272 ms
PyTorch/naive17602 ms
HuggingFace/Arrow17876 ms
PyTorch/cached113238 ms
GPU Utilization at 80 ms — longer bar is better
Knonik/Pipelined97.5%
Knonik/Pipelined+mmap97.3%
Knonik/OnDemand97.0%
Knonik/Batched85.7%
PyTorch/naive80.2%
Knonik/Batched+mmap77.7%
HuggingFace/Arrow72.0%
PyTorch/cached11.4%
GPU Step Latency

Select the target GPU step time to inspect detailed results.

Detailed Metrics at 80 ms

LoaderEpoch AvgEpoch 1Epoch LastWait AvgWait P95Compute AvgGPU UtilSlow StepsCold StartSteps
Knonik/Pipelined11124.87 ms10865.29 ms11353.01 ms2.53 ms0.88 ms97.88 ms97.48%0.30%561.03 ms332
Knonik/Pipelined+mmap11867.04 ms11878.98 ms11875.12 ms2.88 ms0.89 ms103.58 ms97.30%0.30%676.16 ms334
Knonik/OnDemand12028.46 ms11786.71 ms12216.72 ms3.17 ms0.59 ms103.17 ms97.02%0.88%353.38 ms339
Knonik/Batched13147.49 ms15755.60 ms11889.15 ms17.11 ms0.94 ms102.66 ms85.72%0.61%4652.14 ms329
Knonik/Batched+mmap14271.71 ms19401.26 ms11807.67 ms29.06 ms1.13 ms100.95 ms77.66%0.91%5954.36 ms329
PyTorch/naive17602.33 ms17736.56 ms17611.55 ms28.37 ms69.53 ms95.20 ms80.17%2.71%1224.06 ms369
HuggingFace/Arrow17875.77 ms17960.78 ms18032.91 ms40.69 ms111.62 ms104.51 ms72.00%3.25%996.33 ms369
PyTorch/cached113238.12 ms112545.50 ms114966.89 ms816.03 ms5796.79 ms83.92 ms11.36%26.83%7394.00 ms369
Results Across Multiple GPU Latency Steps

Toggle latency columns to compare loaders across different GPU step times.

Epoch Avg (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
Knonik/OnDemand6947.52 ms12028.46 ms14344.75 ms21005.55 ms31178.94 ms49677.34 ms
Knonik/Pipelined7109.17 ms11124.87 ms14434.05 ms21408.47 ms30776.18 ms46518.91 ms
Knonik/Pipelined+mmap7220.33 ms11867.04 ms15025.73 ms21812.50 ms30615.88 ms48338.39 ms
Knonik/Batched10081.72 ms13147.49 ms14838.57 ms22386.07 ms32749.01 ms47793.26 ms
Knonik/Batched+mmap10673.37 ms14271.71 ms16260.71 ms23661.97 ms32029.54 ms49448.97 ms
PyTorch/naive17161.19 ms17602.33 ms21087.95 ms26657.59 ms36226.32 ms50790.25 ms
HuggingFace/Arrow17212.71 ms17875.77 ms19132.83 ms27207.12 ms36939.51 ms53552.00 ms
PyTorch/cached110949.90 ms113238.12 ms110640.48 ms114023.91 ms115950.91 ms101944.83 ms

GPU Utilization (%) — higher is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
Knonik/OnDemand94.86 %97.02 %97.52 %98.26 %98.87 %99.26 %
Knonik/Pipelined96.00 %97.48 %98.10 %98.66 %99.09 %99.40 %
Knonik/Pipelined+mmap94.89 %97.30 %97.91 %98.53 %98.94 %99.34 %
Knonik/Batched71.62 %85.72 %88.96 %92.62 %94.87 %96.49 %
Knonik/Batched+mmap60.35 %77.66 %84.35 %91.18 %93.34 %95.73 %
PyTorch/naive53.59 %80.17 %87.76 %90.92 %93.60 %95.25 %
HuggingFace/Arrow38.13 %72.00 %76.86 %85.56 %89.67 %92.99 %
PyTorch/cached8.35 %11.36 %14.03 %18.85 %24.72 %44.13 %

Wait P95 (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
Knonik/OnDemand0.62 ms0.59 ms0.59 ms0.64 ms0.63 ms0.69 ms
Knonik/Pipelined0.93 ms0.88 ms0.95 ms0.76 ms0.74 ms0.63 ms
Knonik/Pipelined+mmap0.95 ms0.89 ms0.75 ms0.73 ms0.66 ms0.58 ms
Knonik/Batched1.03 ms0.94 ms0.91 ms0.79 ms0.72 ms0.75 ms
Knonik/Batched+mmap0.84 ms1.13 ms0.88 ms0.69 ms0.67 ms0.62 ms
PyTorch/naive350.48 ms69.53 ms16.49 ms15.72 ms14.51 ms15.36 ms
HuggingFace/Arrow480.37 ms111.62 ms82.25 ms64.04 ms73.38 ms64.19 ms
PyTorch/cached5849.99 ms5796.79 ms3486.36 ms5500.18 ms5276.42 ms2448.90 ms
Test 2
HDF5 (uncompressed) vs Knonik (compressed)

A PyTorch DataLoader reading raw HDF5 files compared against Knonik loading the same data in Knonik-compressed format. HDF5 is the dominant format in robot learning codebases (ACT, Diffusion Policy, ALOHA), so this test quantifies the cost of staying uncompressed.

Fastest Epoch Avg
Knonik/test-Pipelined
11139.66 ms
Highest GPU Util
Knonik/test-Pipelined
97.54%
Best Wait P95
Knonik/test-OnDemand
0.55 ms
Lowest Cold Start
HDF5/PyTorch
360.84 ms
Epoch Avg at 80 ms — shorter bar is faster
Knonik/test-Pipelined11140 ms
Knonik/test-OnDemand11458 ms
Knonik/test-Batched13715 ms
HDF5/PyTorch14331 ms
GPU Utilization at 80 ms — longer bar is better
Knonik/test-Pipelined97.5%
Knonik/test-OnDemand96.9%
HDF5/PyTorch91.2%
Knonik/test-Batched87.7%
GPU Step Latency

Select the target GPU step time to inspect detailed results.

Detailed Metrics at 80 ms

LoaderEpoch AvgEpoch 1Epoch LastWait AvgWait P95Compute AvgGPU UtilSlow StepsCold StartSteps
Knonik/test-Pipelined11139.66 ms11044.93 ms11157.82 ms2.46 ms0.89 ms97.48 ms97.54%0.30%549.17 ms334
Knonik/test-OnDemand11458.38 ms11320.43 ms11506.53 ms3.12 ms0.55 ms98.16 ms96.92%0.88%378.62 ms339
Knonik/test-Batched13715.38 ms15985.30 ms12719.59 ms15.35 ms0.91 ms109.60 ms87.72%0.61%4421.64 ms329
HDF5/PyTorch14331.42 ms13995.98 ms14666.33 ms10.08 ms10.82 ms104.45 ms91.20%0.80%360.84 ms375
Results Across Multiple GPU Latency Steps

Toggle latency columns to compare loaders across different GPU step times.

Epoch Avg (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
Knonik/test-OnDemand7052.02 ms11458.38 ms14620.52 ms19156.38 ms28739.06 ms46602.07 ms
Knonik/test-Pipelined8016.29 ms11139.66 ms14434.32 ms20468.30 ms30698.24 ms49449.93 ms
Knonik/test-Batched9410.90 ms13715.38 ms14982.71 ms21862.60 ms31887.81 ms47644.03 ms
HDF5/PyTorch9818.50 ms14331.42 ms17334.10 ms25212.11 ms37186.92 ms51858.74 ms

GPU Utilization (%) — higher is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
Knonik/test-OnDemand94.87 %96.92 %97.48 %98.14 %98.77 %99.25 %
Knonik/test-Pipelined96.57 %97.54 %98.03 %98.55 %99.08 %99.43 %
Knonik/test-Batched70.19 %87.72 %89.53 %92.70 %94.95 %96.64 %
HDF5/PyTorch87.02 %91.20 %92.93 %95.37 %96.58 %97.73 %

Wait P95 (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
Knonik/test-OnDemand0.56 ms0.55 ms0.64 ms0.60 ms0.61 ms0.61 ms
Knonik/test-Pipelined1.03 ms0.89 ms0.81 ms0.80 ms0.70 ms0.66 ms
Knonik/test-Batched1.14 ms0.91 ms1.09 ms0.87 ms0.63 ms0.65 ms
HDF5/PyTorch10.24 ms10.82 ms10.49 ms9.75 ms9.85 ms9.36 ms
Test 3
LeRobot v3 vs Knonik

A LeRobot v3 format dataset with 4 camera streams plus action and state. Tests three loaders: the standard LeRobot/PyTorch-delta10 loader (10-step delta chunks), LeRobotV3/HuggingFace Arrow, and Knonik v3 OnDemand. All loaders process the same dataset at the same batch size and worker count.

Fastest Epoch Avg
Knonik/v3-OnDemand
23651.99 ms
Highest GPU Util
Knonik/v3-OnDemand
94.97%
Best Wait P95
Knonik/v3-OnDemand
1.66 ms
Lowest Cold Start
Knonik/v3-OnDemand
1107.15 ms
Epoch Avg at 80 ms — shorter bar is faster
Knonik/v3-OnDemand23652 ms
LeRobotV3/HuggingFace47397 ms
LeRobot/PyTorch-delta101049220 ms
GPU Utilization at 80 ms — longer bar is better
Knonik/v3-OnDemand95.0%
LeRobotV3/HuggingFace42.0%
LeRobot/PyTorch-delta1039.8%
GPU Step Latency

Select the target GPU step time to inspect detailed results.

Detailed Metrics at 80 ms

LoaderEpoch AvgEpoch 1Epoch LastWait AvgWait P95Compute AvgGPU UtilSlow StepsCold StartSteps
LeRobot/PyTorch-delta101049219.96 ms1052626.55 ms1044428.89 ms294.65 ms1101.69 ms82.96 ms39.79%90.97%4237.11 ms6,432
LeRobotV3/HuggingFace47396.87 ms47345.75 ms47517.88 ms127.80 ms717.45 ms92.52 ms42.02%13.49%1899.48 ms645
Knonik/v3-OnDemand23651.99 ms23393.36 ms23806.98 ms6.14 ms1.66 ms115.86 ms94.97%0.52%1107.15 ms581
Results Across Multiple GPU Latency Steps

Toggle latency columns to compare loaders across different GPU step times.

Epoch Avg (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
LeRobot/PyTorch-delta101038784.91 ms1049219.96 ms1042843.79 ms1055643.57 ms1113293.16 ms1332737.62 ms
LeRobotV3/HuggingFace46600.29 ms47396.87 ms47454.97 ms53355.45 ms65495.01 ms94582.25 ms
Knonik/v3-OnDemand19351.68 ms23651.99 ms27227.39 ms46508.11 ms61666.45 ms80102.92 ms

GPU Utilization (%) — higher is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
LeRobot/PyTorch-delta1034.05 %39.79 %45.38 %56.86 %67.15 %75.19 %
LeRobotV3/HuggingFace24.49 %42.02 %55.37 %80.96 %86.06 %90.94 %
Knonik/v3-OnDemand65.69 %94.97 %95.79 %97.54 %98.25 %98.54 %

Wait P95 (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader50 ms80 ms100 ms150 ms200 ms300 ms
LeRobot/PyTorch-delta101327.28 ms1101.69 ms937.33 ms529.83 ms206.53 ms225.98 ms
LeRobotV3/HuggingFace1000.13 ms717.45 ms507.53 ms59.45 ms55.39 ms53.60 ms
Knonik/v3-OnDemand135.78 ms1.66 ms1.61 ms1.83 ms1.55 ms1.58 ms
Cross-Section Winners

Fastest Loader by Test Across Selected Latencies

Shows the fastest epoch-average loader for each test at each GPU latency.

Test50 ms80 ms100 ms150 ms200 ms300 ms
Baseline vs Knonik vs PyTorch vs HuggingFaceKnonik/OnDemand6947.52 msKnonik/Pipelined11124.87 msKnonik/OnDemand14344.75 msKnonik/OnDemand21005.55 msKnonik/Pipelined+mmap30615.88 msKnonik/Pipelined46518.91 ms
HDF5 (uncompressed) vs Knonik (compressed)Knonik/test-OnDemand7052.02 msKnonik/test-Pipelined11139.66 msKnonik/test-Pipelined14434.32 msKnonik/test-OnDemand19156.38 msKnonik/test-OnDemand28739.06 msKnonik/test-OnDemand46602.07 ms
LeRobot v3 vs KnonikKnonik/v3-OnDemand19351.68 msKnonik/v3-OnDemand23651.99 msKnonik/v3-OnDemand27227.39 msKnonik/v3-OnDemand46508.11 msKnonik/v3-OnDemand61666.45 msKnonik/v3-OnDemand80102.92 ms
Analysis

What the Results Show

Key findings from both tests, based on measured benchmark data.

GPU Starvation Is the Default State for PyTorch Loaders

The starkest result in Test 1 is the PyTorch/cached loader. Despite using a per-worker LRU cache to eliminate redundant video decoding, it achieves only 8 to 44% estimated GPU utilization across the six latency targets. Wait P95 values range from 2,400 ms to over 5,800 ms, meaning at the 95th percentile the GPU is sitting idle for nearly six seconds waiting for a single batch. This is not a configuration problem. It reflects the fundamental overhead of multiprocess worker coordination, memory sharing, and collation in PyTorch DataLoader when video data is involved. Even at 300 ms GPU step time, the epoch average barely improves compared to 50 ms: the loader, not the compute, is the bottleneck.

Knonik Runs the GPU at Near-Saturation Across All Latencies

All Knonik loader variants (Batched, Pipelined, and OnDemand) maintain wait P95 values between 0.59 ms and 0.95 ms across every tested latency. GPU utilization is consistently 94 to 99%. The data pipeline essentially disappears from the training loop: the GPU is never waiting for data in any measurable sense. The difference is most pronounced at low GPU latencies (50 to 100 ms), where data pipeline efficiency matters most. At 50 ms, Knonik achieves epoch times of 7,000 to 8,000 ms while PyTorch/naive takes 17,000 ms and PyTorch/cached takes over 110,000 ms.

mmap Disk Caching Offers Marginal Gains in the Baseline Test

Knonik's mmap variants (Batched+mmap, Pipelined+mmap) show small performance differences versus their in-memory counterparts, sometimes slightly faster, sometimes slightly slower depending on latency. The baseline dataset fits well within the OS page cache, meaning mmap's benefit of avoiding repeated decode work is already captured by the kernel's file cache. The mmap variants become more meaningful on larger datasets where decoded tensors exceed available RAM.

HDF5 Is a Significant Bottleneck Compared to Knonik (Test 2)

In Test 2, the HDF5/PyTorch loader reads uncompressed HDF5 files, the most common format in robot learning codebases. Despite HDF5 being an established format with chunked I/O support, it introduces substantial wait latency compared to Knonik loading the same data in Knonik-compressed format. This cost compounds across epochs: because HDF5 files are shared across workers via memory-mapped file handles, the loader struggles to hide I/O latency even with prefetching. Knonik's compressed format is designed specifically for chunk-sequential random access across episodes, which maps efficiently onto the access pattern of trajectory training.

The Efficiency Gap Is Largest at Low GPU Step Times

Across both tests, the performance gap between Knonik and alternatives narrows as GPU step time increases. At 300 ms, even the PyTorch/naive loader can keep up reasonably well because the GPU is slow enough that any pipeline has time to prefetch the next batch. As model training gets faster through hardware improvements and efficient architectures, the dataloader becomes an increasingly critical bottleneck.