Your Robot is Only as Good as Your Worst Demonstration

Language models had the internet. Vision models had ImageNet. Robotics has a person holding a controller, trying not to bump into things, for the forty seventh time today. The quality of that person's demonstrations is the single biggest lever you have. Most teams treat it as an afterthought.

There is a belief in robotics right now that data quantity solves quality problems. Collect enough demonstrations, the thinking goes, and the noise washes out. The good signal survives. The model learns the right thing.

This is wrong, and it is wrong in a way that costs teams months.

In imitation learning, bad demonstrations do not average out. They compound. A fumbled grasp does not become a neutral data point when you add a hundred successful grasps around it. It becomes a region of the state space where your policy learned contradictory behavior, and at test time, when the robot enters that region, it does the thing the bad demonstration taught it: hesitate, jitter, or collide. The policy does not know the demonstration was bad. It only knows it was in the training set.

This is the fundamental asymmetry of data quality in robot learning. Good demonstrations contribute linearly. Bad demonstrations contribute nonlinearly, because they create failure modes your model would not have had otherwise.

Why robotics data quality is a different beast

In most of machine learning, a mislabeled example is a small tax on your model's accuracy. If 1% of your ImageNet labels are wrong, your classifier is roughly 1% worse. The damage is proportional and bounded. You can afford some noise because the task is tolerant.

Robot learning does not work this way. A manipulation policy is not classifying static images. It is generating a sequence of continuous actions over time, and each action depends on the state created by the previous action. This is what makes the quality problem so pernicious: errors compound temporally. A policy that learned a slightly wrong approach angle from a sloppy demonstration will enter a state it has never seen before, which produces an action the model was never trained on, which creates an even more unfamiliar state. This is covariate shift, and it is the defining failure mode of behavior cloning. The literature calls it "compounding errors." Practitioners call it "the robot goes off the rails after two seconds."

The critical insight is that compounding errors are not caused by insufficient data. They are caused by insufficient data quality at the decision boundaries — the moment of grasp, the transition from approach to contact, the handoff between motion primitives. These are exactly the moments where teleoperators are most likely to be inconsistent.

The taxonomy of bad demonstrations

Not all bad data is created equal. The failure modes fall into distinct categories, each with different consequences for policy learning.

Operator fatigue degradation. A teleoperator does their best work in the first 30 to 45 minutes of a session. After that, reaction times slow, trajectories become less smooth, and grasps get sloppier. The operator does not notice because the degradation is gradual. But the data notices. Episodes collected in hour three of a session have measurably worse kinematic smoothness, longer task completion times, and more corrective submovements than episodes from hour one. If your pipeline treats all episodes equally, your model is learning two different styles of manipulation and trying to reconcile them.

Silent sensor corruption. A camera mount shifts slightly. A joint encoder starts reporting with a small offset. A force torque sensor develops a bias. These are not dramatic failures. The data "looks" fine on visual inspection. But the proprioceptive stream no longer matches the visual stream in the way the model expects, and the policy learns a subtle misalignment that only shows up as degraded performance across the board, impossible to diagnose from aggregate metrics.

Recovered failures that read as successes. The operator fumbles the grasp, catches the object mid drop, and completes the task. From a task completion standpoint, this is a successful demonstration. From a learning standpoint, it is toxic. The recovery behavior introduces a trajectory through state space that the policy should never reproduce, but because the episode ends in success, it is never flagged. The model learns the fumble as a valid strategy.

Kinematic implausibility from latency. Network delays between the operator interface and the robot create a mismatch between intended and executed actions. The operator sends a smooth trajectory, but what arrives at the robot is a stuttered version with variable delays. The recorded data captures the stuttered reality, not the smooth intent. The policy learns to reproduce the stutter.

Quantity does not fix this. It makes it worse.

The instinct when policies underperform is to collect more data. It feels right. More examples means more coverage means better generalization. And in some cases it works — when the problem is genuinely about state space coverage.

But if your existing data has quality problems, collecting more data with the same process gives you more data with the same quality problems. You are scaling your noise alongside your signal. And because the noise in robotics data is structured — operator fatigue follows a predictable pattern, teleoperator A always approaches from the left, that one camera mount is always slightly loose — scaling up does not wash it out. It reinforces it.

This is something the NLP and vision communities never had to reckon with at this level. When GPT was trained on internet text, the sheer volume of good text overwhelmed the garbage. Robotics does not have internet scale data. Teams are working with hundreds or thousands of demonstrations, not millions. At that scale, every bad episode has a measurable impact on policy performance.

The question is never "do we have enough data?" The question is "do we know which of our data is actually good?" If you cannot answer the second question, the first question does not matter.

What "good" actually means in robotics demonstrations

Defining data quality in robotics is harder than it sounds. In supervised learning, quality is simple: the label is either correct or incorrect. In robotics demonstrations, quality is a spectrum with multiple axes.

Task completion is necessary but wildly insufficient. A demonstration that completes the task but takes an erratic path, bumps into obstacles, or grasps the object in an unstable way is a "successful" episode that teaches bad habits.

Kinematic smoothness matters because it correlates with how learnable a trajectory is. Smooth demonstrations produce policies with smoother, more predictable outputs. Jerky demonstrations — especially those caused by teleoperator corrections — introduce high frequency noise into the action labels that makes the regression problem harder.

Strategic consistency is about whether the demonstration matches the implicit strategy you want the policy to learn. If you want a top down grasp, a demonstration that approaches from the side is not wrong in the task completion sense, but it is wrong for your policy's behavior distribution.

Sensor fidelity means that all recorded modalities actually reflect what happened. Camera frames are not dropped. Joint states are not stale. Timestamps are accurate. Force readings are not saturated. This is the boring, mechanical dimension of quality, and it is the one most often violated because nobody checks.

Informational contribution is the subtlest dimension. A demonstration that is perfectly executed but identical to fifty other demonstrations in the dataset is not adding value. It is reinforcing a region of the state space that is already well covered. The marginal demonstration should expand coverage, not duplicate it.

Evaluating all of these axes manually does not scale. This is precisely where robotics data infrastructure earns its keep. Knonik quality scoring pipeline can assess demonstrations across these dimensions automatically — flagging kinematic outliers, detecting sensor anomalies, scoring strategic consistency against the dataset's existing distribution, and ranking episodes by their marginal informational contribution.

The collection process is the product

Most teams think of data collection as a means to an end. The collection itself is just a chore to get through.

This framing is backwards. In robotics, the data collection process is the core product of your team's effort. Every decision you make about collection directly shapes what your model can learn. How you set up the workspace. How you select and train teleoperators. How you structure sessions to manage fatigue. How you define what counts as a successful demonstration. These are not operational details. They are research decisions with first order effects on policy performance.

Consider the difference between a team that collects data in long, unstructured sessions (operator teleoperates for three hours, someone labels episodes as good or bad at the end of the day) and a team that collects in structured blocks (30 minute sessions with mandatory breaks, automated quality checks after every 10 episodes, real time feedback on kinematic smoothness). The second team will have a smaller dataset. It will also have a better policy. Every time.

The best robotics teams treat data collection like a manufacturing process, with quality control at every stage. They define acceptance criteria before collection starts. They instrument the pipeline to catch problems in real time, not after the fact. They version their datasets and track which collection sessions contributed to which policy checkpoints.

The hidden cost of curation by hand

The current industry standard for data quality in robotics is manual review. Someone watches a sample of episodes — maybe 10 to 20% — and flags the obviously bad ones. The rest are assumed to be fine.

This approach has three fatal problems.

First, it is slow. Reviewing even a few hundred episodes takes hours of human time. At thousands of episodes across multiple tasks, it becomes a full time job.

Second, it catches the wrong things. Humans are good at spotting dramatic failures: the robot drops the object, the arm collides with the table. Humans are terrible at spotting the subtle quality issues that actually matter most for policy learning — slight sensor desync, gradual kinematic degradation from fatigue, an approach angle that is 15 degrees off from the rest of the dataset. These are invisible to the eye but measurable in the data.

Third, it does not scale. When you go from one task to five tasks, from one robot to three robots, from one teleoperator to eight, manual curation collapses. You end up reviewing less of the dataset, catching less of the noise, and shipping worse policies as a result.

Automated quality assessment — the kind built into the Knonik robotics data infrastructure pipeline — inverts this. Every episode is evaluated. Outliers are flagged instantly, not days later. Trends are surfaced (operator B's quality degrades after 25 minutes; camera 2 is developing a timestamp drift) before they contaminate enough data to matter. And all of this happens as part of the ingestion pipeline, not as a separate manual step.

The unsexy truth about robot learning in 2026

Everyone wants to talk about model architectures. Foundation models for robotics. Cross embodiment transfer. Scaling laws for manipulation. These are real and important research directions.

But the teams that are actually deploying reliable manipulation policies today are not the ones with the best models. They are the ones that have solved the data problem. They collect with discipline. They assess quality automatically. They know exactly what is in their dataset and why. And when their policy fails, they can trace the failure back to a data issue and fix it, rather than retrain from scratch and hope for the better.

This is what Knonik is built for. Not another model. Not another simulator. Robotics data infrastructure that treats data quality as an engineering problem with measurable inputs and measurable outputs, not a vague aspiration. Because in robot learning, the data is the model. Everything else is just parameterization.

You do not have a model problem. You have a data problem. And you will keep having a data problem until you build, or adopt, the infrastructure to measure it.