Does training on Knonik-compressed data produce models that learn as well as training on raw uncompressed data? We benchmarked three production policy architectures on a 14-DOF dual-arm manipulation task, comparing learning quality and deployment transfer across both formats.
This benchmark answers two questions simultaneously. First, learning quality: when both the compressed and uncompressed training runs start from the same random initialisation and train for the same number of steps, does the compressed training signal produce a model of equivalent quality? Second, domain transfer: does a model trained exclusively on compressed data generalise to clean uncompressed observations - the typical real-world deployment scenario?
Both questions are answered using surrogate models that replicate the architecture and training dynamics of production robotics policy networks without requiring the full training infrastructure or massive compute budgets. The surrogate approach is standard practice for data pipeline benchmarking: it isolates the effect of the data source from the effect of model scale. All three models were trained on the act_14dof dataset - 50 episodes of 14-DOF dual-arm manipulation, each 400 timesteps, with RGB observations at 480×640 and full joint state (position, velocity, action) at 14 dimensions.
For each model, two runs are conducted from the identical random initialisation. Run A trains on Knonik-compressed data using the Knonik loader. Run B trains on raw uncompressed HDF5 files using a standard PyTorch DataLoader. In both cases, validation is always performed on the uncompressed val set - there is no data-format advantage in evaluation. Periodic validation is capped at 32 batches for speed; after training completes, compressed runs receive a full post-training evaluation over the entire uncompressed dataset with no batch cap, which is the domain transfer score.
Single-step denoising MSE on a noisy action sample.
hidden_dim=256 · obs_history=2 · lr=1e-4 · grad_clip=1.0
L1 reconstruction loss + KL divergence (weight 10.0) via CVAE.
hidden_dim=512 · history_len=16 · kl_weight=10.0 · lr=1e-4
Flow-matching MSE on the velocity field u_t = noise − action.
hidden_dim=256 · num_layers=4 · action_horizon=16 · lr=1e-4
Final validation MAE, best loss checkpoints, training time, and domain transfer gap for all six runs.
| Model | Training | Final Val MAE | Best Val Loss | Best Step | Train Time | Full Uncomp. MAE | Domain Gap |
|---|---|---|---|---|---|---|---|
| Diffusion Policy | compressed | 0.0604 | 0.00918 | 500 | 107.4 s | 0.0637 | +0.0033 |
| Diffusion Policy | uncompressed | 0.0780 | 0.01344 | 500 | 98.8 s | - | - |
| ACT | compressed | 0.0911 | 0.09243 | 450 | 105.8 s | 0.0931 | +0.0020 |
| ACT | uncompressed | 0.0919 | 0.09489 | 500 | 96.3 s | - | - |
| Flow Matching Policy | compressed | 0.0917 | 0.12391 | 500 | 110.8 s | 0.0930 | +0.0012 |
| Flow Matching Policy | uncompressed | 0.0970 | 0.13166 | 500 | 100.2 s | - | - |
Below 1.0 means compressed training outperforms uncompressed.
Do models trained on compressed data generalise to clean uncompressed observations at deployment time?
| Model | Periodic Val MAE | Full Uncompressed MAE | Gap | Gap % | Verdict |
|---|---|---|---|---|---|
| Diffusion Policy | 0.0604 | 0.0637 | +0.0033 | +5.5% | Excellent |
| ACT | 0.0911 | 0.0931 | +0.0020 | +2.2% | Excellent |
| Flow Matching Policy | 0.0917 | 0.0930 | +0.0012 | +1.3% | Excellent |
All domain transfer gaps are within 0.004 MAE absolute across all three models. Models trained on Knonik-compressed data generalise cleanly to uncompressed deployment data - the periodic val scores recorded during training closely match the post-training full-dataset evaluation, confirming the compressed training signal is genuine and not a measurement artefact.
Training loss, validation loss, and validation action MAE curves for each architecture.









The most striking result is that all three models achieve lower validation MAE when trained on Knonik-compressed data. Diffusion Policy shows the largest effect - compressed training reaches a final MAE of 0.0604, compared to 0.0780 with raw uncompressed data, a 22.6% reduction. ACT and the Flow Matching Policy show smaller but consistent advantages (0.991× and 0.945× respectively). The most plausible mechanism is that video codec compression introduces subtle temporal smoothing in the RGB stream, acting as implicit data augmentation that reduces overfitting on fine visual texture. This effect is consistent across architectures with very different inductive biases - from denoising diffusion to CVAE-based action chunking to flow matching - suggesting it is a property of the data format rather than any specific model.
The central practical question is whether models trained on compressed data generalise to uncompressed observations at deployment. All three models show small, bounded domain transfer gaps: +5.5% for Diffusion Policy (0.0637 vs 0.0604), +2.2% for ACT (0.0931 vs 0.0911), and +1.3% for the Flow Matching Policy (0.0930 vs 0.0917). In absolute MAE terms the largest gap is 0.0033. These are well within acceptable bounds for continuous action prediction - the model has learned the underlying manipulation task geometry, not the specific artefacts of the encoding format. The periodic val evaluator, which uses uncompressed data throughout training, closely tracked the post-training full-dataset evaluation in all three cases, validating that the compressed-data training signal is genuine.
None of the six runs meet the automatic convergence criterion at 500 steps - all models retain headroom. Diffusion Policy and the Flow Matching Policy are both still descending at step 500; ACT compressed is the only run that shows early saturation, peaking at step 450. Training time overhead for compressed data is negligible across all three architectures, ranging from 8% to 10%, consistent with Knonik's video decode running in background worker processes overlapped with GPU forward passes.