Compression and dataloading are usually treated as separate problems. One team member picks a compression format. Another writes the dataloader. They rarely sit in the same room and talk about the tradeoff they are jointly creating. But these two components are locked in a dance, and the choreography between them shapes your training throughput in ways most teams never measure.
The relationship is simple to state and deceptively hard to get right. Compression reduces the amount of data you need to read from disk or transfer over the network. Good. But compressed data must be decoded before your model can train on it, and that decoding takes CPU time. As your compression strategy gets more aggressive, the decode cost climbs. At some point the CPU spends so long decompressing each batch that the GPU finishes its forward and backward pass and sits idle, waiting. You have traded one bottleneck for another.
Go the other direction and store everything raw or lightly compressed: now your decode time is negligible, but your I/O bandwidth becomes the constraint. The disk or network cannot shovel bytes to the CPU fast enough, and the GPU starves for the same reason, just from the opposite end.
The sweet spot exists somewhere in the middle. And in robotics, finding it is significantly harder than in vision or NLP, because robotics data is not a neat pile of images or a token stream. It is multimodal, variable frequency, temporally coupled, and structurally heterogeneous. The compression strategy that works for your camera frames is wrong for your joint states. The decode latency that is acceptable for proprioceptive data at 100Hz is catastrophic when applied to stereo image pairs at 30fps. This is the kind of problem that only reveals itself when you instrument your pipeline and actually measure where the time goes.
Why this matters more in robotics than anywhere else
In standard vision training, your data is homogeneous. Every sample is an image, maybe with a label. The dataloader reads it, decodes it, applies augmentations, and hands a tensor to the GPU. The compression story is simple. The decode story is also simple. The entire loop is a solved problem.
Robotics data is a fundamentally different beast. A single demonstration episode from a manipulation task might contain two or three camera streams (each at different resolutions and frame rates), joint position and velocity at 100 to 500Hz, gripper state, force torque readings, end effector poses, and sometimes language annotations or task labels. These modalities are sampled at different rates, stored in different formats, and must be temporally aligned before your model sees them.
Now multiply this by the fact that episodes are variable length. Some demonstrations are 3 seconds. Some are 45 seconds. A batch of 32 episodes might have wildly different temporal extents, and each one has a different number of camera frames, joint readings, and force samples that need to be padded or truncated and collated into a single tensor.
This is the environment in which the compression/dataloading tradeoff plays out. And it is why getting it right requires domain awareness that generic ML tooling simply does not have.
The anatomy of a compression decision
When you compress robotics data, you are making a bet about where your bottleneck will be at training time. The spectrum looks roughly like this.
No compression (raw storage). Your camera frames are stored as raw pixel arrays. Your joint states are full precision floats. Decode time: effectively zero. Storage cost: enormous. A single teleoperation session with two stereo cameras at 30fps, plus proprioceptive data at hundreds of Hz, can produce several gigabytes per episode. Across thousands of episodes, you are looking at tens of terabytes. Your storage groans. Your network mount slows to a crawl during training because multiple workers are all trying to read simultaneously and the bandwidth is saturated.
Light compression. You apply a fast, general purpose compressor. Decode is quick, ratios are modest — maybe 2x to 3x. For proprioceptive data that has a lot of temporal redundancy (joint angles change smoothly), this works well. For camera frames, the ratio is unimpressive because pixel data does not compress well with generic byte level schemes. You have cut your storage and I/O in half, added a small CPU overhead that is usually hidden behind the I/O wait. This is where most teams start, and it is a reasonable default.
Medium compression. Better ratios, maybe 3x to 5x overall. But decode slows noticeably. You are starting to trade meaningful CPU time for storage savings. With enough dataloader workers, the CPU decode can usually be parallelized enough that the GPU stays fed. But you are now burning a significant chunk of your host CPU budget just on decompression, which competes with augmentation, collation, and everything else happening on the host.
Aggressive compression. Ratios of 10x or higher. These are the schemes that understand the structure of the data: that camera frames have spatial redundancy, that sequential frames in an episode are temporally correlated, that proprioceptive signals are smooth and low bandwidth. The compression results can be remarkable. But decode time is no longer negligible. If your GPU processes a batch faster than your CPU can decompress the next one, the dataloader cannot keep up no matter how many workers you throw at it.
The sweet spot is not a fixed point. It shifts depending on your hardware. If you have fast local storage, you can afford less compression because I/O is not the constraint. If you are reading from a network file system or cloud storage, the I/O penalty for uncompressed data is much steeper, and you need more aggressive compression to keep the pipe full. This is why a proper robotics data infrastructure solution, like the Knonik data infrastructure platform, treats compression as an adaptive property of the pipeline rather than a one time format decision.
Dataloading: the invisible cost center
Your GPU costs somewhere between $2 and $30 per hour depending on the card. Every second it spends idle because the dataloader cannot keep up is money burned with zero training progress. This is not a theoretical concern. It is the default state of most robotics training pipelines.
The math is straightforward. Your model processes a batch in some number of milliseconds — that is the budget your dataloader has to produce the next batch. In that window, it needs to read compressed data from disk, decompress each modality stream, temporally align everything, handle variable length sequences, collate into a batched tensor, and transfer to GPU. If any step in that chain takes longer than the model's compute time, the GPU sits idle.
Generic dataloaders attempt to solve this with parallelism: multiple worker processes prefetching batches concurrently. This helps, up to a point. But robotics data creates specific pathologies that generic parallelism does not address well.
There is the variable decode time problem. Different episodes have different lengths. A short episode decodes quickly; a long one takes much longer. If a single slow episode lands in a batch, the entire batch is delayed by its slowest constituent. This is the "straggler" problem, analogous to head of line blocking in networking. The dataloader finishes most episodes and then waits for the one outlier. The GPU waits for the dataloader. Everyone waits.
There is the multimodal alignment overhead. Camera frames, joint states, and force torque data all arrive at different frequencies. Aligning these to a common temporal grid is not free. It involves timestamp reconciliation and interpolation, and in a naive implementation, the cost adds up fast across a full batch.
And there is the memory pressure from prefetching. To hide decode latency, dataloaders prefetch multiple batches. But decoded multimodal robotics data is large, and prefetching several batches means holding significant host memory per GPU just for the data queue. On a multi GPU node, this crowds out memory available for everything else running on the host.
The symphony: how compression and dataloading must harmonize
Here is the core insight that connects everything above. Compression and dataloading are not independent design choices. They are two halves of a single optimization problem, and the objective function is GPU utilization.
The goal is not maximum compression. The goal is not maximum decode speed. The goal is: the GPU never waits for data. Everything else is a means to that end.
This means the compression strategy must be chosen with full knowledge of the dataloading architecture, and vice versa. Concretely, this plays out in several interconnected ways.
Per modality compression. There is no reason to apply the same compression scheme to every data stream in an episode. Different modalities have different statistical properties, different redundancy profiles, and different decode cost characteristics. Treating them uniformly leaves performance on the table.
Decode parallelism must match I/O parallelism. If your compression scheme requires heavy CPU work to decode, your dataloader needs more parallel capacity. But more parallelism means more memory, more context switches, and more contention on the I/O bus. The two must be co-tuned. A pipeline where the decode side is starved because the I/O side cannot feed it fast enough, or vice versa, is wasting resources on both ends.
Prefetch depth is a function of decode variance. The straggler problem means you need enough buffer to absorb the variance in decode time across episodes. If episode lengths vary by 15x (common in robotics), you need deeper buffering to ensure the GPU always has a batch ready. A format that produces uniform decode time per sample dramatically reduces the required buffer depth.
Random access matters. Training loops shuffle episodes across epochs. If your compression format requires sequential decompression, random access to arbitrary episodes means wasteful seeking and redundant work. A format that supports independent episode level access allows the dataloader to pull any episode from the dataset without penalty. This is a compression design choice with massive dataloading implications.
What this looks like in practice: a concrete example
Let us trace through a real scenario. You have thousands of demonstration episodes of a manipulation task collected via teleoperation.
Uncompressed: the dataset is massive. Stored on a shared network volume, your dataloader is constantly waiting for I/O. Your model finishes each batch in tens of milliseconds, but the next batch takes hundreds of milliseconds to arrive. GPU utilization: single digits. Your expensive hardware is doing almost nothing useful.
Naive compression with a generic scheme: the dataset shrinks significantly. Read time drops. But now the decode step is the bottleneck. The CPU is spending more time decompressing than the GPU spends training. GPU utilization improves, but it is still well under 25%. You have moved the bottleneck, not removed it.
Modality aware compression with a co-designed dataloader: each data stream is compressed with the scheme best suited to its structure and decode cost profile. The dataloader is built to exploit this, overlapping I/O with decode and keeping the GPU fed. GPU utilization: above 90%.
That third scenario is the difference between a training run that finishes in days and one that would take weeks to produce the same result. Same data. Same model. Same GPU. The only difference is how compression and dataloading were designed together instead of separately.
Why most teams get this wrong
The reason this tradeoff is so often mishandled is organizational, not technical. Compression is a data engineering decision, usually made when the collection pipeline is built. Dataloading is an ML engineering decision, usually made when the training script is written. These are often different people, working at different times, with different priorities.
The data engineer picks whatever format is the default in the robotics community because it is familiar and the tooling makes it easy. The ML engineer writes a dataset class that opens each file and lets the library decompress on the fly. Neither one profiles the interaction. Neither one measures where the time goes. The result is a training loop that runs at a fraction of potential GPU utilization and everyone assumes "that is just how robotics training works."
It is not how it has to work. It is how it works when compression and dataloading are treated as separate problems instead of what they actually are: two halves of a single system that must be co-designed.
This is one of the core things the Knonik robotics data infrastructure gets right. The compression format and the dataloader are not separate components glued together after the fact. They are designed as a single system, where the encoding decisions are made with decode performance in mind, and the dataloader is built to exploit the specific structure of the compressed format.
The learning signal question
There is one more dimension to this that deserves attention. Not all compression is created equal in terms of what it preserves.
In robotics, the learning signal is distributed across modalities in ways that are often subtle. A diffusion policy might rely on fine grained visual texture to infer contact state. An ACT model might be sensitive to exact gripper aperture values at the moment of grasp. A VLA might use subtle spatial relationships between objects that are only distinguishable in the high frequency components of the image.
Standard lossy compression — the kind that works fine for image classification — can destroy these signals. Compression tuned for "is this a cat or a dog?" is not the same as compression tuned for "at what exact pixel coordinate does the gripper contact the edge of this object?" Aggressive quantization of joint states saves on storage but can truncate precision that your policy actually needs for fine manipulation tasks.
This is why the Knonik approach to data infrastructure compression is learning signal aware. The compression does not just minimize file size. It minimizes file size under the constraint that information relevant to policy training is preserved. What counts as "relevant" depends on the model architecture, the task, and the modality — which is why this cannot be a static format choice. It has to be an intelligent component of the pipeline that understands what it is compressing and why.
Practical takeaways
If you are building or maintaining a robotics training pipeline and you have not profiled the interaction between your compression format and your dataloader, you are almost certainly leaving GPU utilization on the table. The first step is measurement: instrument your training loop and actually look at where time is spent. Measure the time between when your model finishes a step and when the next batch is available. If there is a gap, your data pipeline is the bottleneck.
Second, stop compressing all modalities the same way. Your camera data and your joint data have completely different statistical properties and completely different decode cost profiles. Treat them separately.
Third, if you are using whatever the community default is "because that is what everyone uses," question it. Measure your actual training throughput and ask whether a different compression strategy would change the picture. You might be surprised by how much performance you are leaving on the floor.
The most important number in your training pipeline is not your loss. It is the ratio of GPU compute time to total step time. If that ratio is below 0.85, you have a data problem, and the answer is almost always in the space between compression and dataloading.