Classical vs. learned robotics

Perceive

Classical perception produces a state estimate. Each perception modality runs its own module — SLAM for visual localization, ICP for depth registration — and a fusion stage combines them into the state vector the planner consumes.

Learned perception produces a set of token embeddings, one or more per sensor/modality into a latent space. Each modality runs its own encoder — vision, tactile, proprio — and you likely have learned modules for combining data from the modalities - cross-attention for learned-vision, MLP for learned-proprio, etc. - but the end result is a set of token embeddings that the planner can consume.

The trade off here is engineering vs. data. For classical systems, when you have to add a new sensor, you hand wire in the fusion of the new sensor with the existing state, and trace how it percolates through the rest of the stack, having strong tests/verifiability/provenance for the changes. In contrast, the learned path means dropping in a new encoder and either fine-tuning or retraining from scratch with (hopefully) enough data for the new encoder.

Classical

Input raw sensor data
Output state estimate (pose, velocities, object poses)

SLAM (simultaneous localization and mapping)
- Builds a map and tracks pose inside it.
- Example: lidar-equipped mobile base in an office.
Extended Kalman filter (EKF)
- Fuses motion-model predictions with sensor measurements.
- Example: IMU + wheel odometry on a differential-drive robot.
Iterative closest point (ICP)
- Aligns two point clouds by rigid transform.
- Example: depth scan registered to a CAD model before grasping.

Learned

Input raw sensor data
Output token embeddings (one or more per modality)

Vision encoder
- Image patches → token embeddings.
- Example: DINOv2, PaliGemma vision tower.
Tactile encoder
- Force / torque readings → embedding.
- Example: small MLP over a 6-DOF F/T sensor.
Proprio projection
- Joint state → token.
- Example: linear projection over joint angles + velocities.

Plan

Classical planning produces a reference trajectory. Each planner module does its work over the state estimate the perception stage produced and the output is the trajectory the controller consumes.

Learned planning lives in the policy's forward pass. The token embeddings from the encoders flow through the transformer, attention layers mix them, and the action head produces either the next action or a short chunk of future actions. The end result is the action sequence the controller consumes.

The trade off here is the amount of engineering needed to reason over the state estimates explicitly vs implicitly. Explicitly, methods like RRT sample collision-free paths, TrajOpt t smooths under dynamics and torque constraints, MPC re-solves a window of the trajectory at each step. Implicit action sequences from the forward pass feel like less work, but you've done that work elsewhere - data collection, imitation learning, collecting out of distribution data. Moreover, it's not clear that we know how to provide guarantees for the action sequences generated by learned planners - not to the same extent as classical planners anyway. There is some research on learning a critic that can provide safety guarantees but it's not a solved problem.

A combined approach is to plan in a learned model of the world and policy to generate the sequence of actions but I'm still researching the efficacy/practicality of such systems.

Classical

Input state estimate and goal
Output reference trajectory (positions, velocities, timings)

RRT (rapidly-exploring random tree)
- Samples collision-free paths in configuration space.
- Example: 7-DoF arm reaching around a table obstacle.
Trajectory optimization (TrajOpt)
- Smooths a path under dynamics, torque, and collision constraints.
- Example: time-optimal joint motion for a pick cycle.
Model predictive control (MPC)
- Solves a constrained optimization each step.
- Example: legged robot adjusting footholds while walking.

Learned

Input token embeddings (+ goal / instruction tokens)
Output action sequence (next action or a chunk)

Action chunking (ACT)
- Predicts a short sequence of future actions per forward pass.
- Example: ALOHA bimanual teleop policies.
Diffusion / flow-matching head
- Denoise (or flow) over the action distribution.
- Example: Diffusion Policy, π-0.
Tokenized autoregressive actions
- Discretize continuous actions, predict as LM tokens.
- Example: RT-2, OpenVLA.

Execute

Classical execute produces motor commands. The controller takes the reference trajectory the planning stage produced, closes the loop on it at high frequency — 500Hz to 1kHz — and drives the actuators directly.

Learned execute is usually still classical. The policy outputs a desired joint position or velocity at 4-50Hz, and a classical low-level controller (PID, impedance) drives the motors at 1kHz. The "learned" part stops at the action; from there a hand-written controller closes the inner loop.

Classical

Input reference trajectory and state
Output motor torques or velocities

PID control
- Commands from position/velocity error.
- Example: joint-level tracking at 500 Hz.
Computed torque control
- Inverse-dynamics feedforward plus feedback.
- Example: fast trajectory tracking on an industrial arm.
Impedance control
- Regulates end-effector force–motion relation.
- Example: peg-in-hole insertion with target contact stiffness.

Learned

Input action / action chunk from the policy
Output motor commands (usually via a classical inner loop)

Defer to classical
- Policy at 10-50Hz; classical PID/impedance closes the inner loop at 1kHz.
- Example: most deployed VLAs (RT-2, OpenVLA, π-0).
Residual policy
- Learned correction added on top of a classical controller's output.
- Example: industrial assembly tasks with a known nominal motion.

Kinds of learned systems

These don't slot into a single stage — each spans the whole pipeline. They're also not mutually exclusive; most contemporary systems combine more than one.

Reinforcement learning

In observation
Out action

Policy gradient (PPO, SAC) Reward-maximizing policy trained in sim.
Reward shaping Dense rewards train but bake in assumptions; sparse rewards rarely train end-to-end.
Sim-to-real Domain randomization narrows the gap; doesn't close it.

Behavioural cloning

In observation
Out action chunk

Naive BC Observation→action regression; compounding error drifts off-distribution.
Action chunking (ACT) Predicts a short sequence of future actions per forward pass.
Diffusion policies Denoising over the action distribution; robust to multi-modal demos.

Vision-language-action models

In image + language
Out action sequence

Co-training Robot trajectories + internet vision-language data; transfers for perception, debated for actions.
RT-2 / OpenVLA VLM backbone + tokenized autoregressive actions.
π-0 / π-0.5 VLM backbone + flow-matching continuous-action head.

Hybrid systems

In varies (image, language, observation)
Out action sequence

SFT + RL post-training Imitation-trained policy fine-tuned with online RL. π-0 → π-0.6 via πRL.
Dual-system Slow deliberative model + fast reactive controller. Helix S1/S2.
Learned + classical control Learned policy outputs to a classical inner-loop controller. Standard for deployment.

Contact, deployment, distribution

Out-of-distribution

Both classical and learned models break when the state is out of distribution. The solutions are different - classical methods counter OOD states with code changes to the planner or controller. Learned methods counter OOD states with new training data and a new model deployment. The question is really around whether your task is OOD heavy or not - some level of generalization can be acheived with both architectures.

Multi-tier deployment

The pattern emerging on the deployment side is a multi-tier architecture organized by time scale, not by paradigm. Figure's published Helix architecture is the cleanest example: a slow deliberative system (S2) at a few Hz running alongside a fast reactive system (S1) at ~200Hz, both learned, both on-device. In actual deployments, that two-tier structure tends to grow a third — a classical inner-loop controller — giving roughly:

Slow tier (seconds)
- Cloud-hosted or large on-device model.
- Strategic reasoning, replanning after failures, language-grounded task decomposition.
Middle tier (10-100ms)
- On-device policy generating actions.
- The "what should I do next" decision, whether VLA, BC policy, or classical planner.
Fast tier (<1ms) for safety
- Inner-loop controller closing the loop on actuators.
- Almost always classical — PID, impedance, hard joint/velocity limits, watchdog timers.

Also, one under appreciated pattern in how you go about reconciling these action sequences across the slow and middle tiers.

References

Vision encoders DINOv2, PaliGemma
RL algorithms PPO, SAC
Imitation learning ACT / ALOHA, Diffusion Policy
VLAs RT-2, OpenVLA, π-0, π-0.5, Helix (Figure)
Benchmarks / sims LIBERO, Isaac Gym, MuJoCo, Drake
Tooling LeRobot
Related on this site Physical AI Foundation Models, explorer