Quick Summary

ACT (Action Chunking with Transformers, Zhao et al. 2023) uses a CVAE to produce action chunks — sequences of 20-100 future actions predicted at once. It is fast (50ms inference on a single GPU), deterministic given a latent sample, and works well with 50-500 demonstrations. Diffusion Policy (Chi et al. 2023) uses a conditional denoising diffusion model over the action space. It is slower (200-500ms with DDPM, 50ms with consistency distillation), explicitly multi-modal, and handles precision tasks and multi-step tasks better with more data.

Inference Latency: The Practical Constraint

For real-time robot control, inference latency is a hard constraint that often determines algorithm choice before any other consideration.

AlgorithmInference TimeControl HzNotes
ACT50 ms20 HzExecutes 20-step chunks; real inference rate much lower
Diffusion Policy (DDPM)500 ms2 HzToo slow for reactive tasks without chunking
Diffusion Policy (DDIM 10 steps)200 ms5 HzAcceptable for slow manipulation
Consistency Policy50 ms20 HzSingle denoising step; matches ACT latency
ACT + temporal ensemble50 ms20 HzSmoothing across multiple chunk predictions

If your task requires reactive high-frequency control (catching, pouring, high-speed assembly), standard Diffusion Policy DDPM is simply too slow. Use ACT or Consistency Policy. If your task involves slow, precision manipulation where 200ms latency is acceptable, DDIM Diffusion Policy is viable and may produce better results.

Task Type Recommendations

  • Fast reactive tasks (catching, stirring, high-speed pick-place): ACT or Consistency Policy. The 20Hz control rate allows real-time adaptation. Diffusion DDPM at 2Hz cannot respond to moving objects.
  • Precision tasks with multiple viable solutions (peg-in-hole, USB insertion, cloth folding): Diffusion Policy. Its explicit multi-modality allows it to model the distribution over successful grasp strategies, not just the mean — which is critical for precision tasks where the mean strategy often fails.
  • Long-horizon tasks over 10 steps: Neither algorithm alone is sufficient. Use hierarchical policies: a task planner selects subtask sequences, then ACT or Diffusion Policy executes individual subtasks.
  • Tasks where object position varies significantly: Both work, but Diffusion Policy tends to generalize better to novel object positions when trained with sufficient data.

Training Data Requirements

AlgorithmMinimum DemosRecommendedTraining Time (single GPU)
ACT50100–5002–4 hours
Diffusion Policy (DDPM)200500–20006–12 hours
Consistency Policy100300–10004–8 hours

ACT's lower data requirement is a genuine advantage for new task exploration where collecting 500+ demos is expensive. However, Diffusion Policy often catches up and surpasses ACT when more data is available — particularly for precision tasks. If you have a data budget above 1,000 demonstrations, seriously evaluate Diffusion Policy.

Hyperparameter Sensitivity

ACT's critical hyperparameter is the KL divergence weight in the CVAE loss. Too low: the latent space collapses and the policy ignores context. Too high: the policy ignores demonstrations and regresses to the mean. Standard recommendation: start at 10 and sweep 1, 5, 10, 50 on a small dataset before full training.

Diffusion Policy is more sensitive to learning rate schedule and noise variance schedule. The original DDPM implementation uses a cosine noise schedule with linear warmup LR; these defaults work well in practice. The most common mistake is using too high a learning rate (>3e-4 with Adam), which causes training instability on contact-rich data.

Decision Matrix

ConditionRecommended Algorithm
< 200 demos availableACT
Reactive/high-speed taskACT or Consistency Policy
Precision with multi-modal solutionsDiffusion Policy
> 500 demos, slow taskDiffusion Policy
Deployment on low-compute hardwareACT (lighter model)
Want to use HuggingFace LeRobotBoth supported

Both algorithms are available as reference implementations in the SVRC data platform, with pre-built training configs for OpenArm 101 and common camera setups.

Under the Hood: ACT Architecture

ACT (Action Chunking with Transformers) combines a CVAE (Conditional Variational Autoencoder) with a transformer architecture to predict action chunks -- sequences of 20-100 future actions from a single observation.

Encoder: The visual encoder processes camera images through a pre-trained backbone (typically ResNet-18 or a ViT variant). Joint state (proprioception) is concatenated with the visual features. The combined observation is fed to a transformer encoder that produces a context embedding.

CVAE: During training, the CVAE encoder takes the ground-truth action sequence and the observation to produce a latent variable z. During inference, z is sampled from the learned prior (conditioned on the observation only). The latent z captures the "style" of the action -- different z samples produce different but valid strategies for completing the task. The KL divergence term in the CVAE loss ensures the prior and posterior distributions stay close, enabling sampling from the prior at test time.

Decoder: The transformer decoder takes the context embedding and the latent z and autoregressively generates the action chunk. Each output position predicts one action (typically 7D: 6D end-effector delta + 1D gripper). The full chunk is generated in a single forward pass (not iteratively like language model tokens), which is why ACT achieves 50ms inference.

Temporal ensemble: At deployment, ACT generates overlapping action chunks at each re-observation step and averages the overlapping predicted actions with exponential weighting (recent predictions weighted more heavily). This smooths transitions between chunks and reduces jitter at chunk boundaries. The ensemble weight decay parameter (typically 0.01) controls the tradeoff between smoothness and responsiveness.

Under the Hood: Diffusion Policy Architecture

Diffusion Policy applies the DDPM (Denoising Diffusion Probabilistic Model) framework to robot action prediction. Instead of generating images from noise (as in image diffusion models), it generates action sequences from noise.

Forward process: During training, noise is progressively added to the ground-truth action sequence over T diffusion steps (typically T=100 for DDPM). At step T, the action is pure Gaussian noise. The model learns to reverse this process: given the noisy action at step t and the observation, predict the noise that was added (epsilon prediction) or the clean action directly.

Network architecture: The denoising network can be either a 1D U-Net (convolutional, operating over the temporal dimension of the action sequence) or a transformer. The U-Net variant is the original and most tested; the transformer variant (DP-T, Diffusion Policy Transformer) is gaining adoption for its better scaling properties. Both take as input: the noisy action sequence, the diffusion timestep t (encoded as a sinusoidal embedding), and the observation encoding from the visual backbone.

Inference: Starting from pure noise, the model iteratively denoises over T steps to produce a clean action sequence. DDPM uses T=100 steps (slow). DDIM (Denoising Diffusion Implicit Models) reduces this to 10-20 steps with minimal quality loss. Consistency distillation collapses the entire denoising process to a single step, matching ACT's inference speed at the cost of additional training complexity.

Multi-modality: The core advantage of Diffusion Policy over ACT is explicit multi-modal action distribution modeling. When a task has multiple valid strategies (approach from left vs. right, grasp from top vs. side), the diffusion model can represent all valid modes in the learned distribution. ACT's CVAE can also represent multiple modes through the latent space, but the KL regularization tends to collapse modes in practice, especially with limited training data. Diffusion Policy preserves modes more reliably because the denoising process can converge to different modes from different noise initializations.

Hybrid Approaches: Combining ACT and Diffusion Policy

Several hybrid approaches have emerged that combine the strengths of both algorithms:

  • DP-T (Diffusion Policy Transformer): Replaces the U-Net denoising network with a transformer, gaining better scaling with action sequence length and observation complexity. DP-T with DDIM 10-step inference achieves 100ms latency (between ACT and standard DP) with multi-modal action prediction. This is increasingly the default choice for teams with sufficient compute.
  • Consistency Policy: Distills a trained Diffusion Policy into a single-step generator using consistency training. The result matches ACT's 50ms inference while retaining Diffusion Policy's multi-modal action distribution. The tradeoff: consistency training adds 50-100% to total training time and requires careful hyperparameter tuning of the consistency loss weight.
  • ACT with multi-head prediction: Instead of a single CVAE decoder, use K decoder heads that each predict a different action chunk. At inference, select the head whose predicted chunk has the lowest reconstruction error against the most recent observation. This provides discrete multi-modality (K modes) at ACT-level inference speed. Practical K values are 3-5.
  • Hierarchical DP + ACT: Use Diffusion Policy for high-level skill selection (which grasp strategy, which approach direction) at low frequency (1-2 Hz), and ACT for low-level trajectory execution at high frequency (20 Hz). This combines DP's multi-modal planning with ACT's fast reactive control.

Training Tips for Both Algorithms

# Common training configuration patterns for LeRobot

# ACT config (lerobot/configs/policy/act.yaml)
# Key hyperparameters to tune:
#   chunk_size: 50-100 (longer = smoother but less reactive)
#   kl_weight: 10 (sweep: 1, 5, 10, 50)
#   lr: 1e-4 (reduce to 5e-5 if training is unstable)
#   batch_size: 32-64
#   n_epochs: 2000 (for 200 demos; scale proportionally)

# Diffusion Policy config (lerobot/configs/policy/diffusion.yaml)
# Key hyperparameters:
#   n_diffusion_steps: 100 (DDPM) or 10 (DDIM at inference)
#   prediction_type: "epsilon" (noise prediction; more stable than "sample")
#   lr: 1e-4 with cosine schedule
#   batch_size: 64-128 (DP benefits from larger batches more than ACT)
#   n_epochs: 3000 (for 200 demos; DP converges slower than ACT)

# Both algorithms:
#   - Use action normalization (zero mean, unit variance per dimension)
#   - Use image augmentation (random crop 84-96%, color jitter)
#   - Monitor validation loss every 50 epochs; save best checkpoint
#   - EMA (exponential moving average) of weights at 0.9999 decay

Related Reading