DataLoader Benchmark

Dataloader Performance Across 6 GPU Latency Steps

Training robot learning models is bottlenecked not just by compute, but by how fast data reaches the GPU. These benchmarks measure whether Knonik keeps the GPU fed under realistic training conditions, comparing it directly against PyTorch DataLoaders and HuggingFace datasets. GPU compute is simulated at six step times (50 to 300 ms), representing everything from fast inference loops to large models. Each batch iteration is split into three phases: wait (GPU idle time), H2D transfer (CPU to GPU), and compute (simulated forward/backward pass).

GPU

NVIDIA GeForce RTX 4070 Ti

Device

cuda

Batch Size

Chunk Length

Epochs

Workers

Test Overview

Three Benchmark Tests

Each test is designed to answer a specific question about dataloader performance across different dataset formats and toolchains.

Test 1

Baseline: Knonik vs PyTorch vs HuggingFace

Runs all loader variants over the same robot dataset: Knonik (Batched, Pipelined, OnDemand, with and without mmap), a naive PyTorch DataLoader with per-chunk video seeking, a cached PyTorch DataLoader with LRU episode caching, and a HuggingFace Arrow-backed loader. Establishes the performance envelope of each approach under identical conditions.

Test 2

HDF5 (PyTorch) vs Knonik Compressed

Compares a standard PyTorch DataLoader reading raw HDF5 files against Knonik loading the same dataset in Knonik-compressed format. HDF5 is the most common format for robot datasets (used in ACT, Diffusion Policy, etc.), so this test directly quantifies the cost of staying uncompressed.

HDF5: 18.4 GBKnonik: 126.2 MB

Test 3

LeRobot v3 (4 cameras) vs Knonik

Compares LeRobot/PyTorch-delta10, LeRobotV3/HuggingFace Arrow, and Knonik v3 OnDemand on the same LeRobot v3 format dataset under identical conditions.

Metrics Explained

What Each Metric Measures

Epoch Avg(ms)

Total wall-clock time for one full pass through the dataset, averaged across all epochs. This is the headline number: lower means the GPU finishes training faster. A loader with fast individual batches can still have a poor epoch average if it stalls between batches.

Epoch 1 vs Epoch Last(ms each)

Time for the first and final epoch separately. The gap between these reveals warm-up behavior: loaders that cache decoded data or prefetch aggressively often see large improvements after the first epoch, while loaders doing redundant work stay flat.

Wait Avg(ms)

Average time the training loop sits idle between batches, from when the GPU finished the last step to when the next batch is delivered. This is the most direct measure of data pipeline efficiency. A well-prefetched loader should deliver batches faster than the GPU can consume them, keeping wait near zero.

Wait P95(ms)

The 95th-percentile batch wait time. While averages can be misleading if a few fast batches mask frequent stalls, P95 exposes tail latency: the worst 1-in-20 batch delays. High P95 values indicate unpredictable spikes, typically from cold video decoding, I/O contention, or cache misses.

Compute Avg(ms)

Average time spent on the simulated GPU forward/backward pass per step. Controlled by the target GPU latency setting. It serves as the denominator for understanding how much of the step time is compute-bound versus data-bound.

GPU Util (est.)(%)

Estimated fraction of epoch time the GPU spends doing compute rather than waiting for data. Computed as 1 minus (wait_avg / total_step_avg). A loader at 99% GPU utilization means the data pipeline adds virtually no overhead. At 50%, the GPU is idle half the time waiting for batches.

Slow Steps(%)

Percentage of batch steps where the wait time exceeded 150 ms, a threshold above which data latency visibly impacts GPU throughput. High slow-step rates indicate reliability problems in the loader, not just average-latency problems.

Cold Start(ms)

The wait time for the very first batch of the run. This reflects loader initialization cost: spawning workers, opening files, warming up prefetch queues. A high cold start is acceptable if subsequent batches are fast, but for short jobs or repeated experiments, it matters.

Test 1

Baseline vs Knonik vs PyTorch vs HuggingFace

All loader variants head-to-head on the same robot dataset: Knonik (Batched, Pipelined, OnDemand, with and without mmap disk cache), a PyTorch DataLoader with per-chunk video seeking, a PyTorch DataLoader with LRU episode caching, and a HuggingFace Arrow-backed loader. All loaders see identical data, epoch count, batch size, and worker count.

Fastest Epoch Avg

Knonik/Pipelined

11124.87 ms

Highest GPU Util

Knonik/Pipelined

97.48%

Best Wait P95

Knonik/OnDemand

0.59 ms

Lowest Cold Start

Knonik/OnDemand

353.38 ms

Epoch Avg at 80 ms — shorter bar is faster

Knonik/Pipelined11125 ms

Knonik/Pipelined+mmap11867 ms

Knonik/OnDemand12028 ms

Knonik/Batched13147 ms

Knonik/Batched+mmap14272 ms

PyTorch/naive17602 ms

HuggingFace/Arrow17876 ms

PyTorch/cached113238 ms

GPU Utilization at 80 ms — longer bar is better

Knonik/Pipelined97.5%

Knonik/Pipelined+mmap97.3%

Knonik/OnDemand97.0%

Knonik/Batched85.7%

PyTorch/naive80.2%

Knonik/Batched+mmap77.7%

HuggingFace/Arrow72.0%

PyTorch/cached11.4%

GPU Step Latency

Select the target GPU step time to inspect detailed results.

Detailed Metrics at 80 ms

Loader	Epoch Avg	Epoch 1	Epoch Last	Wait Avg	Wait P95	Compute Avg	GPU Util	Slow Steps	Cold Start	Steps
Knonik/Pipelined	11124.87 ms	10865.29 ms	11353.01 ms	2.53 ms	0.88 ms	97.88 ms	97.48%	0.30%	561.03 ms	332
Knonik/Pipelined+mmap	11867.04 ms	11878.98 ms	11875.12 ms	2.88 ms	0.89 ms	103.58 ms	97.30%	0.30%	676.16 ms	334
Knonik/OnDemand	12028.46 ms	11786.71 ms	12216.72 ms	3.17 ms	0.59 ms	103.17 ms	97.02%	0.88%	353.38 ms	339
Knonik/Batched	13147.49 ms	15755.60 ms	11889.15 ms	17.11 ms	0.94 ms	102.66 ms	85.72%	0.61%	4652.14 ms	329
Knonik/Batched+mmap	14271.71 ms	19401.26 ms	11807.67 ms	29.06 ms	1.13 ms	100.95 ms	77.66%	0.91%	5954.36 ms	329
PyTorch/naive	17602.33 ms	17736.56 ms	17611.55 ms	28.37 ms	69.53 ms	95.20 ms	80.17%	2.71%	1224.06 ms	369
HuggingFace/Arrow	17875.77 ms	17960.78 ms	18032.91 ms	40.69 ms	111.62 ms	104.51 ms	72.00%	3.25%	996.33 ms	369
PyTorch/cached	113238.12 ms	112545.50 ms	114966.89 ms	816.03 ms	5796.79 ms	83.92 ms	11.36%	26.83%	7394.00 ms	369

Results Across Multiple GPU Latency Steps

Toggle latency columns to compare loaders across different GPU step times.

Epoch Avg (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Knonik/OnDemand	6947.52 ms	12028.46 ms	14344.75 ms	21005.55 ms	31178.94 ms	49677.34 ms
Knonik/Pipelined	7109.17 ms	11124.87 ms	14434.05 ms	21408.47 ms	30776.18 ms	46518.91 ms
Knonik/Pipelined+mmap	7220.33 ms	11867.04 ms	15025.73 ms	21812.50 ms	30615.88 ms	48338.39 ms
Knonik/Batched	10081.72 ms	13147.49 ms	14838.57 ms	22386.07 ms	32749.01 ms	47793.26 ms
Knonik/Batched+mmap	10673.37 ms	14271.71 ms	16260.71 ms	23661.97 ms	32029.54 ms	49448.97 ms
PyTorch/naive	17161.19 ms	17602.33 ms	21087.95 ms	26657.59 ms	36226.32 ms	50790.25 ms
HuggingFace/Arrow	17212.71 ms	17875.77 ms	19132.83 ms	27207.12 ms	36939.51 ms	53552.00 ms
PyTorch/cached	110949.90 ms	113238.12 ms	110640.48 ms	114023.91 ms	115950.91 ms	101944.83 ms

GPU Utilization (%) — higher is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Knonik/OnDemand	94.86 %	97.02 %	97.52 %	98.26 %	98.87 %	99.26 %
Knonik/Pipelined	96.00 %	97.48 %	98.10 %	98.66 %	99.09 %	99.40 %
Knonik/Pipelined+mmap	94.89 %	97.30 %	97.91 %	98.53 %	98.94 %	99.34 %
Knonik/Batched	71.62 %	85.72 %	88.96 %	92.62 %	94.87 %	96.49 %
Knonik/Batched+mmap	60.35 %	77.66 %	84.35 %	91.18 %	93.34 %	95.73 %
PyTorch/naive	53.59 %	80.17 %	87.76 %	90.92 %	93.60 %	95.25 %
HuggingFace/Arrow	38.13 %	72.00 %	76.86 %	85.56 %	89.67 %	92.99 %
PyTorch/cached	8.35 %	11.36 %	14.03 %	18.85 %	24.72 %	44.13 %

Wait P95 (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Knonik/OnDemand	0.62 ms	0.59 ms	0.59 ms	0.64 ms	0.63 ms	0.69 ms
Knonik/Pipelined	0.93 ms	0.88 ms	0.95 ms	0.76 ms	0.74 ms	0.63 ms
Knonik/Pipelined+mmap	0.95 ms	0.89 ms	0.75 ms	0.73 ms	0.66 ms	0.58 ms
Knonik/Batched	1.03 ms	0.94 ms	0.91 ms	0.79 ms	0.72 ms	0.75 ms
Knonik/Batched+mmap	0.84 ms	1.13 ms	0.88 ms	0.69 ms	0.67 ms	0.62 ms
PyTorch/naive	350.48 ms	69.53 ms	16.49 ms	15.72 ms	14.51 ms	15.36 ms
HuggingFace/Arrow	480.37 ms	111.62 ms	82.25 ms	64.04 ms	73.38 ms	64.19 ms
PyTorch/cached	5849.99 ms	5796.79 ms	3486.36 ms	5500.18 ms	5276.42 ms	2448.90 ms

Test 2

HDF5 (uncompressed) vs Knonik (compressed)

A PyTorch DataLoader reading raw HDF5 files compared against Knonik loading the same data in Knonik-compressed format. HDF5 is the dominant format in robot learning codebases (ACT, Diffusion Policy, ALOHA), so this test quantifies the cost of staying uncompressed.

Fastest Epoch Avg

Knonik/test-Pipelined

11139.66 ms

Highest GPU Util

Knonik/test-Pipelined

97.54%

Best Wait P95

Knonik/test-OnDemand

0.55 ms

Lowest Cold Start

HDF5/PyTorch

360.84 ms

Epoch Avg at 80 ms — shorter bar is faster

Knonik/test-Pipelined11140 ms

Knonik/test-OnDemand11458 ms

Knonik/test-Batched13715 ms

HDF5/PyTorch14331 ms

GPU Utilization at 80 ms — longer bar is better

Knonik/test-Pipelined97.5%

Knonik/test-OnDemand96.9%

HDF5/PyTorch91.2%

Knonik/test-Batched87.7%

GPU Step Latency

Select the target GPU step time to inspect detailed results.

Detailed Metrics at 80 ms

Loader	Epoch Avg	Epoch 1	Epoch Last	Wait Avg	Wait P95	Compute Avg	GPU Util	Slow Steps	Cold Start	Steps
Knonik/test-Pipelined	11139.66 ms	11044.93 ms	11157.82 ms	2.46 ms	0.89 ms	97.48 ms	97.54%	0.30%	549.17 ms	334
Knonik/test-OnDemand	11458.38 ms	11320.43 ms	11506.53 ms	3.12 ms	0.55 ms	98.16 ms	96.92%	0.88%	378.62 ms	339
Knonik/test-Batched	13715.38 ms	15985.30 ms	12719.59 ms	15.35 ms	0.91 ms	109.60 ms	87.72%	0.61%	4421.64 ms	329
HDF5/PyTorch	14331.42 ms	13995.98 ms	14666.33 ms	10.08 ms	10.82 ms	104.45 ms	91.20%	0.80%	360.84 ms	375

Results Across Multiple GPU Latency Steps

Toggle latency columns to compare loaders across different GPU step times.

Epoch Avg (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Knonik/test-OnDemand	7052.02 ms	11458.38 ms	14620.52 ms	19156.38 ms	28739.06 ms	46602.07 ms
Knonik/test-Pipelined	8016.29 ms	11139.66 ms	14434.32 ms	20468.30 ms	30698.24 ms	49449.93 ms
Knonik/test-Batched	9410.90 ms	13715.38 ms	14982.71 ms	21862.60 ms	31887.81 ms	47644.03 ms
HDF5/PyTorch	9818.50 ms	14331.42 ms	17334.10 ms	25212.11 ms	37186.92 ms	51858.74 ms

GPU Utilization (%) — higher is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Knonik/test-OnDemand	94.87 %	96.92 %	97.48 %	98.14 %	98.77 %	99.25 %
Knonik/test-Pipelined	96.57 %	97.54 %	98.03 %	98.55 %	99.08 %	99.43 %
Knonik/test-Batched	70.19 %	87.72 %	89.53 %	92.70 %	94.95 %	96.64 %
HDF5/PyTorch	87.02 %	91.20 %	92.93 %	95.37 %	96.58 %	97.73 %

Wait P95 (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Knonik/test-OnDemand	0.56 ms	0.55 ms	0.64 ms	0.60 ms	0.61 ms	0.61 ms
Knonik/test-Pipelined	1.03 ms	0.89 ms	0.81 ms	0.80 ms	0.70 ms	0.66 ms
Knonik/test-Batched	1.14 ms	0.91 ms	1.09 ms	0.87 ms	0.63 ms	0.65 ms
HDF5/PyTorch	10.24 ms	10.82 ms	10.49 ms	9.75 ms	9.85 ms	9.36 ms

Test 3

LeRobot v3 vs Knonik

A LeRobot v3 format dataset with 4 camera streams plus action and state. Tests three loaders: the standard LeRobot/PyTorch-delta10 loader (10-step delta chunks), LeRobotV3/HuggingFace Arrow, and Knonik v3 OnDemand. All loaders process the same dataset at the same batch size and worker count.

Fastest Epoch Avg

Knonik/v3-OnDemand

23651.99 ms

Highest GPU Util

Knonik/v3-OnDemand

94.97%

Best Wait P95

Knonik/v3-OnDemand

1.66 ms

Lowest Cold Start

Knonik/v3-OnDemand

1107.15 ms

Epoch Avg at 80 ms — shorter bar is faster

Knonik/v3-OnDemand23652 ms

LeRobotV3/HuggingFace47397 ms

LeRobot/PyTorch-delta101049220 ms

GPU Utilization at 80 ms — longer bar is better

Knonik/v3-OnDemand95.0%

LeRobotV3/HuggingFace42.0%

LeRobot/PyTorch-delta1039.8%

GPU Step Latency

Select the target GPU step time to inspect detailed results.

Detailed Metrics at 80 ms

Loader	Epoch Avg	Epoch 1	Epoch Last	Wait Avg	Wait P95	Compute Avg	GPU Util	Slow Steps	Cold Start	Steps
LeRobot/PyTorch-delta10	1049219.96 ms	1052626.55 ms	1044428.89 ms	294.65 ms	1101.69 ms	82.96 ms	39.79%	90.97%	4237.11 ms	6,432
LeRobotV3/HuggingFace	47396.87 ms	47345.75 ms	47517.88 ms	127.80 ms	717.45 ms	92.52 ms	42.02%	13.49%	1899.48 ms	645
Knonik/v3-OnDemand	23651.99 ms	23393.36 ms	23806.98 ms	6.14 ms	1.66 ms	115.86 ms	94.97%	0.52%	1107.15 ms	581

Results Across Multiple GPU Latency Steps

Toggle latency columns to compare loaders across different GPU step times.

Epoch Avg (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
LeRobot/PyTorch-delta10	1038784.91 ms	1049219.96 ms	1042843.79 ms	1055643.57 ms	1113293.16 ms	1332737.62 ms
LeRobotV3/HuggingFace	46600.29 ms	47396.87 ms	47454.97 ms	53355.45 ms	65495.01 ms	94582.25 ms
Knonik/v3-OnDemand	19351.68 ms	23651.99 ms	27227.39 ms	46508.11 ms	61666.45 ms	80102.92 ms

GPU Utilization (%) — higher is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
LeRobot/PyTorch-delta10	34.05 %	39.79 %	45.38 %	56.86 %	67.15 %	75.19 %
LeRobotV3/HuggingFace	24.49 %	42.02 %	55.37 %	80.96 %	86.06 %	90.94 %
Knonik/v3-OnDemand	65.69 %	94.97 %	95.79 %	97.54 %	98.25 %	98.54 %

Wait P95 (ms) — lower is better

Cells highlighted in green are best for that latency.

Loader	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
LeRobot/PyTorch-delta10	1327.28 ms	1101.69 ms	937.33 ms	529.83 ms	206.53 ms	225.98 ms
LeRobotV3/HuggingFace	1000.13 ms	717.45 ms	507.53 ms	59.45 ms	55.39 ms	53.60 ms
Knonik/v3-OnDemand	135.78 ms	1.66 ms	1.61 ms	1.83 ms	1.55 ms	1.58 ms

Cross-Section Winners

Fastest Loader by Test Across Selected Latencies

Shows the fastest epoch-average loader for each test at each GPU latency.

Test	50 ms	80 ms	100 ms	150 ms	200 ms	300 ms
Baseline vs Knonik vs PyTorch vs HuggingFace	Knonik/OnDemand6947.52 ms	Knonik/Pipelined11124.87 ms	Knonik/OnDemand14344.75 ms	Knonik/OnDemand21005.55 ms	Knonik/Pipelined+mmap30615.88 ms	Knonik/Pipelined46518.91 ms
HDF5 (uncompressed) vs Knonik (compressed)	Knonik/test-OnDemand7052.02 ms	Knonik/test-Pipelined11139.66 ms	Knonik/test-Pipelined14434.32 ms	Knonik/test-OnDemand19156.38 ms	Knonik/test-OnDemand28739.06 ms	Knonik/test-OnDemand46602.07 ms
LeRobot v3 vs Knonik	Knonik/v3-OnDemand19351.68 ms	Knonik/v3-OnDemand23651.99 ms	Knonik/v3-OnDemand27227.39 ms	Knonik/v3-OnDemand46508.11 ms	Knonik/v3-OnDemand61666.45 ms	Knonik/v3-OnDemand80102.92 ms

Analysis

What the Results Show

Key findings from both tests, based on measured benchmark data.

GPU Starvation Is the Default State for PyTorch Loaders

The starkest result in Test 1 is the PyTorch/cached loader. Despite using a per-worker LRU cache to eliminate redundant video decoding, it achieves only 8 to 44% estimated GPU utilization across the six latency targets. Wait P95 values range from 2,400 ms to over 5,800 ms, meaning at the 95th percentile the GPU is sitting idle for nearly six seconds waiting for a single batch. This is not a configuration problem. It reflects the fundamental overhead of multiprocess worker coordination, memory sharing, and collation in PyTorch DataLoader when video data is involved. Even at 300 ms GPU step time, the epoch average barely improves compared to 50 ms: the loader, not the compute, is the bottleneck.

Knonik Runs the GPU at Near-Saturation Across All Latencies

All Knonik loader variants (Batched, Pipelined, and OnDemand) maintain wait P95 values between 0.59 ms and 0.95 ms across every tested latency. GPU utilization is consistently 94 to 99%. The data pipeline essentially disappears from the training loop: the GPU is never waiting for data in any measurable sense. The difference is most pronounced at low GPU latencies (50 to 100 ms), where data pipeline efficiency matters most. At 50 ms, Knonik achieves epoch times of 7,000 to 8,000 ms while PyTorch/naive takes 17,000 ms and PyTorch/cached takes over 110,000 ms.

mmap Disk Caching Offers Marginal Gains in the Baseline Test

Knonik's mmap variants (Batched+mmap, Pipelined+mmap) show small performance differences versus their in-memory counterparts, sometimes slightly faster, sometimes slightly slower depending on latency. The baseline dataset fits well within the OS page cache, meaning mmap's benefit of avoiding repeated decode work is already captured by the kernel's file cache. The mmap variants become more meaningful on larger datasets where decoded tensors exceed available RAM.

HDF5 Is a Significant Bottleneck Compared to Knonik (Test 2)

In Test 2, the HDF5/PyTorch loader reads uncompressed HDF5 files, the most common format in robot learning codebases. Despite HDF5 being an established format with chunked I/O support, it introduces substantial wait latency compared to Knonik loading the same data in Knonik-compressed format. This cost compounds across epochs: because HDF5 files are shared across workers via memory-mapped file handles, the loader struggles to hide I/O latency even with prefetching. Knonik's compressed format is designed specifically for chunk-sequential random access across episodes, which maps efficiently onto the access pattern of trajectory training.

The Efficiency Gap Is Largest at Low GPU Step Times

Across both tests, the performance gap between Knonik and alternatives narrows as GPU step time increases. At 300 ms, even the PyTorch/naive loader can keep up reasonably well because the GPU is slow enough that any pipeline has time to prefetch the next batch. As model training gets faster through hardware improvements and efficient architectures, the dataloader becomes an increasingly critical bottleneck.

Back to Product