Back to Product
DataLoader Benchmark

DataLoader Performance:
4 Experiments, Real Robotics Data

Four independent experiments measuring how Knonik loads data compared to PyTorch, HuggingFace datasets, and HDF5 on real robot episode datasets. All benchmarks run on identical hardware with calibrated GPU simulation workloads. The fourth experiment isolates GPU utilisation across model sizes - showing how Knonik scales as training steps get heavier.

GPU
NVIDIA RTX 4070 Ti
RAM
62 GB
Batch size
16
Chunk length
10
Epochs
3
Workers
8
GPU sim target
50 ms matmul
Introduction

What Was Tested and Why

DataLoader throughput is the most under-measured variable in robotics training. Teams optimise model architecture for hours and then train on a loader that stalls the GPU 30% of the time. These experiments measure the real cost of that choice across three independent setups.

The benchmark script runs experiments 1–3 in a single pass using the same GPU simulation, batch size, chunk length, and epoch count. Experiment 4 re-runs the three Knonik loader modes at three different GPU step durations (50 ms, 150 ms, 300 ms) to measure how utilisation scales with model weight. Results are written to JSON after each run. All numbers on this page come directly from those outputs.

Experiment 1
HDF5 (18.4 GB) vs Knonik (126 MB)

Same 50-episode dataset. One in raw HDF5 format, one compressed by Knonik to 126 MB. Does HDF5's mmap advantage outweigh a 146× size penalty?

Experiment 2
LeRobotV3/HuggingFace (800 MB) vs Knonik (189 MB)

Same 49-episode LeRobot v3 dataset. LeRobot v3 is the format teams are adopting. This is the comparison most relevant to teams building on the HuggingFace robotics ecosystem.

Experiment 3
6-Loader Architecture Comparison

Knonik Batched, OnDemand, and Pipelined vs PyTorch naive, PyTorch cached, and HuggingFace Arrow - all on the same VP9 dataset. Isolates loader architecture from data format.

Experiment 4
GPU Utilisation vs Training Step Duration

Three Knonik loader modes benchmarked at 50 ms, 150 ms, and 300 ms GPU step times. Shows how loader efficiency translates to GPU utilisation as model size increases.

Experiment 1

HDF5/PyTorch vs Knonik Compressed

50 episodes × 400 timesteps. HDF5 uncompressed at 18.4 GB vs Knonik compressed at 126 MB. Same data, same GPU sim, 3 epochs.

Episodes
50 × 400 timesteps
Uncompressed
18.4 GB (HDF5)
Knonik compressed
126 MB
Compression ratio
146×
Streams
action, qpos, qvel, rgb_top
LoaderData SizeCold StartWait AvgWait p50Wait p95Slow StepsEpoch 1Epoch 3Epoch AvgGPU Util
HDF5/PyTorch18.4 GB308 ms32.0 ms9.1 ms128.4 ms2.7%6,304 ms3,845 ms4,663 ms61.0%
Knonik/OnDemand126 MB308 ms27.6 ms17.7 ms84.3 ms2.1%3,618 ms3,612 ms3,618 ms64.4%
Knonik/Batched126 MB4,019 ms39.2 ms1.2 ms22.7 ms1.0%11,659 ms1,156 ms4,649 ms56.1%
Knonik/Pipelined126 MB2,827 ms45.6 ms1.4 ms26.6 ms1.6%13,164 ms1,336 ms5,320 ms52.3%
Epoch avg winner
Knonik/OnDemand
3,618 ms vs 4,663 ms HDF5
Epoch 3 (Batched)
3.3× faster
1,156 ms vs 3,845 ms
p95 tail latency
5.6× better
22.7 ms vs 128.4 ms (Batched)
Storage
146× smaller
126 MB vs 18.4 GB

Reading the result

With all three Knonik modes, the picture is clear: Knonik/OnDemand beats HDF5 outright in epoch average (3,618 ms vs 4,663 ms) with no warm-up cost. Knonik/Batched ties HDF5 in the 3-epoch average (4,649 ms vs 4,663 ms) but drops to 1,156 ms by epoch 3 - a 3.3× advantage. All Knonik modes have dramatically lower p95 tail latency (22–84 ms vs 128 ms), meaning fewer GPU stalls every run. HDF5's mmap speed advantage only shows up in Pipelined's 3-epoch average, which pays a heavy epoch-1 cold start. And all of this at 146× less storage.

Experiment 2

LeRobotV3 / HuggingFace vs Knonik Compressed

49 episodes × 700 timesteps. LeRobot v3 in HuggingFace Arrow format (800 MB) vs Knonik compressed (189 MB). The most relevant comparison for teams building on the HuggingFace ecosystem.

Episodes
49 × 700 timesteps
HuggingFace (LeRobot v3)
800 MB (h264/webp)
Knonik compressed
189 MB
Compression ratio
4.2×
Streams
action, state, cam_high
LoaderData SizeCold StartWait AvgWait p50Wait p95Slow StepsEpoch 1Epoch 3Epoch AvgGPU Util
LeRobotV3/HuggingFace800 MB559 ms87.4 ms17.9 ms291.5 ms26.8%19,985 ms19,790 ms19,934 ms36.4%
LeRobot/PyTorch800 MB1,200 ms121.1 ms13.8 ms839.9 ms18.2%287,485 ms288,268 ms287,608 ms29.2%
Knonik/OnDemand189 MB244 ms18.4 ms13.4 ms54.5 ms0.5%4,651 ms4,552 ms4,602 ms73.1%
Knonik/Batched189 MB4,983 ms28.4 ms1.3 ms24.1 ms0.5%14,925 ms2,009 ms6,314 ms63.8%
Knonik/Pipelined189 MB3,556 ms51.0 ms3.2 ms38.3 ms1.2%26,645 ms2,815 ms10,619 ms49.5%
Note on LeRobot/PyTorch epoch times: LeRobotDataset returns individual frames, so with batch_size=16 it produces ~2,144 batches/epoch vs ~215 for chunk-based loaders (chunk_len=10). The epoch times are not directly comparable - LeRobot processes 10× more batches on the same data. Per-batch wait (121 ms avg) is lower than HuggingFace (355 ms avg), but the total epoch wall-time (287,608 ms) is dominated by batch count.

How Each Loader Was Implemented

Expand each card to see the actual loading code used in the benchmark. All three approaches run over the same LeRobot v3 dataset on the same hardware.

Best epoch avg
4.3× faster
OnDemand: 4,602 ms vs 19,934 ms
Best GPU util
73.1%
OnDemand vs 36.4% HuggingFace
p95 tail latency
12.1× better
Batched: 24.1 ms vs 291.5 ms
Slow steps
53× fewer
0.5% vs 26.8%

Reading the result

Every Knonik mode beats HuggingFace. Knonik/OnDemand is 4.3× faster with a 244 ms cold start and 73.1% GPU utilisation - the highest across all experiments. Knonik/Batched drops to 2,009 ms by epoch 3 (10× faster than HF's flat 19,790 ms). Even Knonik/Pipelined, which pays a large epoch-1 cold start, reaches 2,815 ms by epoch 3. LeRobot v3 with HuggingFace shows zero warm-up benefit across all 3 epochs - video decodes live every epoch with no caching. Slow steps at 26.8% mean the GPU stalls more than one batch in four.

Experiment 3

6-Loader Architecture Comparison

Knonik Batched, OnDemand, and Pipelined vs PyTorch naive, PyTorch cached, and HuggingFace Arrow - all on the same VP9 dataset. 49 episodes, 3 epochs.

Knonik/Batched

Pre-decodes all episodes serially before epoch starts. Cross-epoch RAM cache.

Knonik/OnDemand

Streams chunks on-the-fly via EpisodeStreamPool. No caching. Lowest cold-start.

Knonik/Pipelined

Parallel decode via ProcessPoolExecutor + POSIX SharedMemory IPC. Cross-epoch RAM cache.

PyTorch/naive

Standard DataLoader. Each __getitem__ seeks to VP9 frame position and decodes chunk_len frames.

PyTorch/cached

Worker-local LRU episode cache (8 episodes/worker). Full episode decoded on first access.

HuggingFace/Arrow

Arrow on disk for numerics. Video decoded per-chunk per-epoch via set_transform().

Epoch Average (ms) - Lower is Faster

Average time per epoch across 3 epochs. Knonik modes in green, alternatives in red.

Knonik/Pipelined
2,831 ms
Knonik/OnDemand
3,815 ms
Knonik/Batched
4,890 ms
PyTorch/naive
15,434 ms
HuggingFace/Arrow
15,627 ms
PyTorch/cached
232,685 ms

Full Performance Numbers

LoaderCold StartWait AvgWait p50Wait p95Slow StepsEpoch 1Epoch 3Epoch AvgGPU Util
Knonik/Pipelined509 ms21.5 ms1.7 ms145.5 ms4.7%5,824 ms1,325 ms2,831 ms69.9%
Knonik/OnDemand346 ms29.4 ms18.1 ms94.6 ms1.8%3,886 ms3,804 ms3,815 ms63.0%
Knonik/Batched4,390 ms41.4 ms1.2 ms27.4 ms1.0%12,346 ms1,168 ms4,890 ms54.7%
PyTorch/naive1,547 ms102.2 ms10.8 ms659.2 ms17.1%15,686 ms15,394 ms15,434 ms32.9%
HuggingFace/Arrow1,698 ms121.7 ms15.4 ms719.3 ms19.0%15,879 ms15,425 ms15,627 ms29.1%
PyTorch/cached5,322 ms1,873 ms1,798 ms4,195 ms54.2%232,644 ms232,133 ms232,685 ms2.6%

Estimated GPU Utilisation

Higher is better. Measures how much of the training loop the GPU is actually working vs waiting for data.

Knonik/Pipelined69.9%
Knonik/OnDemand63.0%
Knonik/Batched54.7%
PyTorch/naive32.9%
HuggingFace/Arrow29.1%
PyTorch/cached2.6%

Projected Performance at Different Training Lengths

Knonik's cross-epoch cache makes it faster the longer you train. PyTorch and HuggingFace stay flat.

Knonik/Batched
10.6× faster
Epoch 1 to Epoch 3
Knonik/Pipelined
4.4× faster
Epoch 1 to Epoch 3
PyTorch/naive
1.0×
No warm-up benefit
EpochsKnonik/BatchedKnonik/OnDemandKnonik/PipelinedPyTorch/naiveHF/Arrow
112,346 ms3,886 ms5,824 ms15,686 ms15,879 ms
34,890 ms3,815 ms2,831 ms15,434 ms15,627 ms
102,286 ms3,812 ms1,775 ms15,434 ms15,627 ms
501,392 ms3,806 ms1,415 ms15,434 ms15,627 ms
1001,280 ms3,805 ms1,370 ms15,434 ms15,627 ms
Experiment 4

GPU Utilisation vs Training Step Duration

The same three Knonik loader modes run with simulated GPU steps of 50 ms, 150 ms, and 300 ms - representing light, medium, and heavy model workloads. As the model takes longer per step, the dataloader's wait time becomes relatively smaller and GPU utilisation climbs.

GPU step = 50 ms
Knonik/Pipelined61%
epoch avg 4,014 ms
Knonik/OnDemand63.7%
epoch avg 3,736 ms
Knonik/Batched55.5%
epoch avg 4,746 ms
GPU step = 150 ms
Knonik/Pipelined89.8%
epoch avg 2,382 ms
Knonik/OnDemand84.1%
epoch avg 3,692 ms
Knonik/Batched78.6%
epoch avg 4,818 ms
GPU step = 300 ms
Knonik/Pipelined94.5%
epoch avg 2,427 ms
Knonik/OnDemand91.4%
epoch avg 3,688 ms
Knonik/Batched87.9%
epoch avg 4,889 ms

GPU Utilisation - All Modes × All Step Durations

GPU StepKnonik/BatchedKnonik/OnDemandKnonik/Pipelined
50 ms55.5%63.7%61%
150 ms78.6%84.1%89.8%
300 ms87.9%91.4%94.5%
Pipelined at 300 ms step
94.5%
GPU utilisation - near-maximum throughput
OnDemand at 300 ms step
91.4%
No cold-start penalty, instant-on
Utilisation gain (50→300 ms)
+33.5pp
Pipelined: 61% → 94.5% as model grows heavier

Reading the result

Knonik's dataloader wait times are essentially constant regardless of GPU step duration - the loader doesn't know or care how long the model takes. What changes is the ratio of wait time to compute time. At 50 ms GPU steps, the ~40 ms average wait represents a meaningful fraction of each cycle. At 300 ms, that same 40 ms wait is a small overhead. Pipelined reaches 94.5% GPU utilisation at 300 ms - meaning the GPU is working 94.5% of the time and waiting on data only 5.5%. The takeaway: Knonik's loader performance naturally scales with model size. The heavier your model, the less the loader matters - and it was already fast.

Analysis

Cross-Experiment Analysis

What the four experiments tell us together - about format choice, caching strategy, model scale, and where GPU time is actually going.

GPU Utilisation Across All Experiments

Knonik/OnDemand - Exp 2 (vs HuggingFace)73.1%
Knonik/Pipelined - Exp 3 (baseline)69.9%
Knonik/Batched - Exp 2 (vs HuggingFace)63.8%
Knonik/OnDemand - Exp 1 (vs HDF5)64.4%
Knonik/OnDemand - Exp 3 (baseline)63%
HDF5/PyTorch - Exp 161%
Knonik/Batched - Exp 1 (vs HDF5)56.1%
Knonik/Batched - Exp 3 (baseline)54.7%
Knonik/Pipelined - Exp 1 (vs HDF5)52.3%
Knonik/Pipelined - Exp 2 (vs HuggingFace)49.5%
LeRobotV3/HuggingFace - Exp 236.4%
PyTorch/naive - Exp 3 (baseline)32.9%
HuggingFace/Arrow - Exp 3 (baseline)29.1%
PyTorch/cached - Exp 3 (baseline)2.6%

1.HDF5 is fast - but still loses after warm-up

HDF5 has fast mmap random access and beats Knonik in the 3-epoch average (4,633 ms vs 7,113 ms). But at epoch 3, Knonik's cross-epoch cache flips the result: 2,995 ms vs 3,790 ms. The real question is whether 18.4 GB of storage is worth winning only the cold-start window.

2.LeRobot v3 / HuggingFace is up to 4.3× slower

LeRobot v3 is the format robotics teams are adopting. Knonik/OnDemand loads it 4.3× faster per epoch, reaching 73.1% GPU utilisation vs 36.4%. Knonik/Batched reaches 12.1× better p95 tail latency. Slow steps drop from 26.8% to 0.5% across modes.

3.Cross-epoch caching is the decisive variable

PyTorch and HuggingFace re-decode video every epoch. Knonik decodes once and reuses from RAM. This is the source of every speedup across experiments 1, 2, and 3 once training extends past 2–3 epochs.

4.Compression reduces storage and improves loading

Smaller data loads faster because more fits in RAM and CPU cache. 18.4 GB to 126 MB is not just cheaper to store - it is the reason Knonik's epoch 3 beats HDF5 despite HDF5 winning epoch 1.

5.Per-worker caching is a trap

PyTorch/cached is 82× slower than Knonik/Pipelined - worse than naive seeking. Shuffled access destroys LRU hit rates. Each worker holds a bounded cache, but the training loop invalidates it faster than it fills.

6.p95 tail latency is the real GPU killer

A 10 ms median wait can coexist with a 719 ms p95. Every slow step stalls the GPU completely. Knonik's p95 stays under 150 ms across all modes. HuggingFace Arrow hits 719 ms p95 - a full stall on 19% of batches.

7.Parallel decode + SharedMemory

True GIL-free parallelism via OS processes. SharedMemory IPC reduces transfer cost from ~100 ms (pickle) to ~18 ms per 370 MB episode, enabling Knonik/Pipelined to reach 69.9% GPU utilisation.

8.OnDemand for short runs, Pipelined for long runs

OnDemand wins at 1–3 epochs (instant cold start). Beyond ~4 epochs, Pipelined's cache amortises and wins permanently. Batched wins at 50+ epochs but pays a large epoch-1 penalty.

Deep-Dive Analysis

Questions These Results Raise

A close reading of the benchmark numbers produces questions that aren't obvious from the summary tables. This section works through eight of them in detail - covering cache behaviour, tail latency, GPU utilisation mechanics, and how to pick the right loader for a given training run.

Q1

Why does PyTorch/cached perform worse than PyTorch/naive, despite caching entire decoded episodes?

PyTorch/cached decodes a full episode on first access and stores it in a per-worker LRU cache (8 episodes per worker). The theory: pay the decode cost once, reuse the result for all future batches drawn from that episode. In practice, this assumption collapses under shuffled sampling.

With 8 parallel workers, each worker maintains its own independent LRU. There is no shared cache between workers. If worker 0 caches episode 47, worker 1 has no visibility of that - it will decode episode 47 again from scratch the next time it receives a batch request for that episode. Across 8 workers pulling random samples from a large dataset, the probability that the same worker gets the same episode before it falls out of an 8-slot LRU is very low.

The result is near-zero cache hit rate, combined with a higher per-miss cost than naive seeking. PyTorch/naive decodes only chunk_len = 10 frames per __getitem__ call. PyTorch/cached decodes the entire episode on every miss. With a median wait of 1,798 ms and 54.2% of steps classified as slow, the cache is paying the maximum possible cost on every request.

Rule of thumb: Worker-local LRU caches are only beneficial when the access pattern is sequential and worker-stable - i.e. the same worker consistently gets batches from the same episode. Shuffled training is the opposite. If your DataLoader uses shuffle=True, episode-level LRU caching per worker will almost always hurt.

Q2

Knonik/Pipelined has the best epoch average (2,831 ms) but the highest p95 tail latency among Knonik modes (145.5 ms). How can both be true?

These two metrics measure different things, and the tension between them is real. Epoch average measures total wall-clock time per epoch - it is dominated by the warm epochs once the cross-epoch cache is populated. p95 wait latency measures the 95th-percentile per-batch data fetch time, capturing outlier stalls within a single epoch.

Pipelined's architecture involves parallel decode processes writing into shared memory, with the training loop consuming from that buffer. Most steps receive data with near-zero wait (p50: 1.7 ms). But occasionally the prefetch pipeline stalls - a decode process falls behind, the buffer drains, and the training loop has to wait. That stall shows up as a high p95 (145.5 ms) and a 4.7% slow-step rate.

OnDemand, by contrast, streams data on-the-fly without the overhead of managing a parallel decode pipeline. Its p50 is 18.1 ms and p95 is 94.6 ms - more consistent, but with no cross-epoch cache benefit. By epoch 3, Pipelined is at 1,325 ms/epoch vs OnDemand's 3,804 ms - the cache advantage overwhelms the occasional stall penalty.

When p95 matters more than epoch avg: if your training step is very short (small model, fast GPU), a 145 ms stall is a large relative cost. For longer GPU compute steps (50+ ms per batch), p95 latency is less impactful and epoch average is the right metric to optimise.

Q3

Why does Knonik/OnDemand win at epoch 1 but lose to Pipelined and eventually Batched at longer training runs?

OnDemand has no cold-start penalty (346 ms) because it does not pre-decode or pre-fill anything. It fetches and decodes data as each batch is requested. This makes it the fastest loader at epoch 1 by a large margin - 3,886 ms vs 5,824 ms for Pipelined and 12,346 ms for Batched.

However, OnDemand pays the same decoding cost on every epoch because it does not maintain a cross-epoch cache. Each epoch is a fresh pass. Pipelined and Batched both build a RAM cache during their first epoch. From epoch 2 onward, data is served from memory - not decoded from storage. The warm epoch time for Pipelined drops to ~1,325 ms and for Batched to ~1,168 ms. OnDemand stays flat at ~3,804 ms every epoch.

Projected epoch averages for longer runs (using cold-start + warm-epoch amortisation):

EpochsBatched avgOnDemand avgPipelined avg
112,346 ms3,886 ms ✓5,824 ms
34,890 ms3,815 ms2,831 ms ✓
102,286 ms3,812 ms1,775 ms ✓
501,392 ms ✓3,806 ms1,415 ms
1001,280 ms ✓3,805 ms1,370 ms

The crossover happens around epoch 2–3 for Pipelined vs OnDemand, and around epoch 50 for Batched overtaking Pipelined (Batched's lower warm-epoch time eventually wins over Pipelined's lower cold-start). OnDemand never wins past epoch 1 once caching loaders warm up.

Q4

HuggingFace/Arrow uses Arrow format for numerics and only decodes video on-demand. Why is it still slower than PyTorch/naive in Experiment 3?

HuggingFace's LeRobot v3 format stores numeric streams (actions, proprioception) in Apache Arrow files - memory-mappable and fast to access. Video frames are stored in compressed formats (h264, webp) and decoded per-chunk via a set_transform() hook at batch time.

In Experiment 3, all loaders are operating on the same VP9 dataset - so the Arrow advantage for numerics doesn't apply; the bottleneck in all cases is video decode. What hurts HuggingFace/Arrow specifically is the set_transform() call overhead: every batch triggers Python-level dispatch into a transform pipeline that decodes, converts, and stacks frames. This adds latency on top of the raw decode cost.

Additionally, HuggingFace's dataset infrastructure is optimised for high-throughput sequential or sliced access patterns - the Arrow columnar layout shines when you read contiguous rows. Random chunk sampling from shuffled episodes produces scattered seeks through the Arrow files, negating the columnar advantage. The result: 15,627 ms epoch average with 19.0% slow steps and a 719 ms p95 - comparable to PyTorch/naive, not better.

Q5

What does "GPU utilisation" actually measure here, and why does 69.9% not mean 30.1% of the GPU is idle?

GPU utilisation in this benchmark is a synthetic metric derived from the ratio of simulated GPU compute time to total wall-clock step time. Each step simulates a fixed 50 ms matmul workload. GPU utilisation is computed as:

gpu_util = gpu_sim_ms / (gpu_sim_ms + wait_ms_per_step)

So 69.9% means that for every second of wall-clock training time, approximately 699 ms is the GPU executing the simulated forward/backward pass, and 301 ms is the CPU waiting for the data loader to produce the next batch. It is a measure of the data-loading overhead fraction, not GPU hardware saturation (SM occupancy, memory bandwidth, etc.).

In real training, the GPU compute step is not a fixed 50 ms - it depends on model size, batch size, and hardware. The 50 ms target was chosen to approximate a mid-scale robotics policy (e.g. ACT, Diffusion Policy). For smaller models with faster forward passes, the data-loading fraction becomes even more dominant. For very large models, it matters less. The benchmark is most representative of training workloads where a single step takes 20–100 ms of GPU compute.

Q6

Why does slow_pct (fraction of slow batches) matter as much as or more than wait_avg?

wait_avg aggregates all per-batch wait times into a single mean. A loader can have a low wait_avg while still regularly stalling the GPU - if most batches are fast and a few are catastrophically slow, the average stays acceptable but training throughput suffers on every slow batch.

slow_pct measures the fraction of steps where the wait time exceeded some threshold (in this benchmark, a multiple of the median). A 19.0% slow_pct for HuggingFace/Arrow means nearly 1 in 5 batches causes a full GPU stall. During those stalls, the GPU sits completely idle - not partially utilised, but at zero throughput. No amount of fast average batches compensates for the idle time during a stall.

This is why PyTorch/naive (wait_avg: 102 ms, slow_pct: 17.1%) and HuggingFace/Arrow (wait_avg: 121 ms, slow_pct: 19.0%) produce almost identical epoch averages (15,434 ms vs 15,627 ms) despite a 19 ms average wait difference - the epoch runtime is dominated by how many times the GPU fully stalls, not by the average wait across all steps.

In practice: if you profile your training loop and find that mean data load time looks acceptable, always also check your p95 and stall count. A loader with a 10 ms mean and 600 ms p95 is a bigger problem than one with 80 ms mean and 100 ms p95.

Q7

Knonik/Batched has a cold start of 4,390 ms - longer than some full epochs. Why pay that upfront cost?

Batched pre-decodes all episodes serially before the first epoch begins. The 4,390 ms cold start is the time spent building a complete decoded representation of the dataset in RAM. This is the entire upfront cost paid once, after which every subsequent epoch reads directly from that in-memory structure.

The consequence is that epoch 1 looks expensive (12,346 ms: 4,390 ms cold start + first epoch pass), but epoch 3 drops to 1,168 ms - the lowest warm-epoch time of all six loaders. Batched's p50 wait is just 1.2 ms at warm state, because fetching a batch is essentially a memory copy from a pre-indexed buffer.

Whether this trade-off is worth it depends entirely on how many epochs you are training. At 3 epochs, Batched's average (4,890 ms) is worse than OnDemand (3,815 ms) and Pipelined (2,831 ms). At 50 epochs, Batched becomes the fastest of the three (1,392 ms avg). The cold start amortises completely once you train long enough.

One practical implication: if you are running short experimental loops (1–5 epochs) to validate a model change, Batched is the wrong choice. Switch to OnDemand. If you are running a full training run of 30+ epochs, Batched pays back its cold start many times over.

Q8

How should a team actually choose between the three Knonik loader modes for their training workload?

The right choice depends on three variables: number of epochs, dataset size relative to available RAM, and whether cold-start latency matters for your workflow.

Knonik/OnDemand
Use when

Short experimental runs (1–5 epochs). Iterating on model architecture. Situations where you need the loop to start immediately.

Trade-off

No cross-epoch cache. You pay full decode cost every epoch. Does not improve with longer training.

Knonik/Pipelined
Use when

Standard training runs (5–50 epochs). Best balanced choice across most robotics training workloads. Wins the most common training lengths.

Trade-off

Highest p95 of the three Knonik modes (145 ms). Occasional decode stalls. Moderate cold start (~509 ms).

Knonik/Batched
Use when

Long production runs (50+ epochs). Dataset fits in RAM. Lowest possible warm-epoch latency is the priority.

Trade-off

Large cold start (4,390 ms) hurts short runs. RAM requirement scales with dataset size - not viable if dataset exceeds available memory.

If you are uncertain, start with OnDemand. It has the lowest downside risk - no large cold start, no RAM commitment - and gives you an accurate baseline. Switch to Pipelined once you confirm you are training for more than a handful of epochs.