ACT vs. Diffusion Policy: A Practical Guide to Choosing the Right Algorithm

Quick Summary

ACT (Action Chunking with Transformers, Zhao et al. 2023) uses a CVAE to produce action chunks — sequences of 20-100 future actions predicted at once. It is fast (50ms inference on a single GPU), deterministic given a latent sample, and works well with 50-500 demonstrations. Diffusion Policy (Chi et al. 2023) uses a conditional denoising diffusion model over the action space. It is slower (200-500ms with DDPM, 50ms with consistency distillation), explicitly multi-modal, and handles precision tasks and multi-step tasks better with more data.

Inference Latency: The Practical Constraint

For real-time robot control, inference latency is a hard constraint that often determines algorithm choice before any other consideration.

Algorithm	Inference Time	Control Hz	Notes
ACT	50 ms	20 Hz	Executes 20-step chunks; real inference rate much lower
Diffusion Policy (DDPM)	500 ms	2 Hz	Too slow for reactive tasks without chunking
Diffusion Policy (DDIM 10 steps)	200 ms	5 Hz	Acceptable for slow manipulation
Consistency Policy	50 ms	20 Hz	Single denoising step; matches ACT latency
ACT + temporal ensemble	50 ms	20 Hz	Smoothing across multiple chunk predictions

If your task requires reactive high-frequency control (catching, pouring, high-speed assembly), standard Diffusion Policy DDPM is simply too slow. Use ACT or Consistency Policy. If your task involves slow, precision manipulation where 200ms latency is acceptable, DDIM Diffusion Policy is viable and may produce better results.

Task Type Recommendations

Fast reactive tasks (catching, stirring, high-speed pick-place): ACT or Consistency Policy. The 20Hz control rate allows real-time adaptation. Diffusion DDPM at 2Hz cannot respond to moving objects.
Precision tasks with multiple viable solutions (peg-in-hole, USB insertion, cloth folding): Diffusion Policy. Its explicit multi-modality allows it to model the distribution over successful grasp strategies, not just the mean — which is critical for precision tasks where the mean strategy often fails.
Long-horizon tasks over 10 steps: Neither algorithm alone is sufficient. Use hierarchical policies: a task planner selects subtask sequences, then ACT or Diffusion Policy executes individual subtasks.
Tasks where object position varies significantly: Both work, but Diffusion Policy tends to generalize better to novel object positions when trained with sufficient data.

Training Data Requirements

Algorithm	Minimum Demos	Recommended	Training Time (single GPU)
ACT	50	100–500	2–4 hours
Diffusion Policy (DDPM)	200	500–2000	6–12 hours
Consistency Policy	100	300–1000	4–8 hours

ACT's lower data requirement is a genuine advantage for new task exploration where collecting 500+ demos is expensive. However, Diffusion Policy often catches up and surpasses ACT when more data is available — particularly for precision tasks. If you have a data budget above 1,000 demonstrations, seriously evaluate Diffusion Policy.

Hyperparameter Sensitivity

ACT's critical hyperparameter is the KL divergence weight in the CVAE loss. Too low: the latent space collapses and the policy ignores context. Too high: the policy ignores demonstrations and regresses to the mean. Standard recommendation: start at 10 and sweep 1, 5, 10, 50 on a small dataset before full training.

Diffusion Policy is more sensitive to learning rate schedule and noise variance schedule. The original DDPM implementation uses a cosine noise schedule with linear warmup LR; these defaults work well in practice. The most common mistake is using too high a learning rate (>3e-4 with Adam), which causes training instability on contact-rich data.

Decision Matrix

Condition	Recommended Algorithm
< 200 demos available	ACT
Reactive/high-speed task	ACT or Consistency Policy
Precision with multi-modal solutions	Diffusion Policy
> 500 demos, slow task	Diffusion Policy
Deployment on low-compute hardware	ACT (lighter model)
Want to use HuggingFace LeRobot	Both supported

Both algorithms are available as reference implementations in the SVRC data platform, with pre-built training configs for OpenArm 1 and common camera setups.

Under the Hood: ACT Architecture

ACT (Action Chunking with Transformers) combines a CVAE (Conditional Variational Autoencoder) with a transformer architecture to predict action chunks -- sequences of 20-100 future actions from a single observation.

Encoder: The visual encoder processes camera images through a pre-trained backbone (typically ResNet-18 or a ViT variant). Joint state (proprioception) is concatenated with the visual features. The combined observation is fed to a transformer encoder that produces a context embedding.

CVAE: During training, the CVAE encoder takes the ground-truth action sequence and the observation to produce a latent variable z. During inference, z is sampled from the learned prior (conditioned on the observation only). The latent z captures the "style" of the action -- different z samples produce different but valid strategies for completing the task. The KL divergence term in the CVAE loss ensures the prior and posterior distributions stay close, enabling sampling from the prior at test time.

Decoder: The transformer decoder takes the context embedding and the latent z and autoregressively generates the action chunk. Each output position predicts one action (typically 7D: 6D end-effector delta + 1D gripper). The full chunk is generated in a single forward pass (not iteratively like language model tokens), which is why ACT achieves 50ms inference.

Temporal ensemble: At deployment, ACT generates overlapping action chunks at each re-observation step and averages the overlapping predicted actions with exponential weighting (recent predictions weighted more heavily). This smooths transitions between chunks and reduces jitter at chunk boundaries. The ensemble weight decay parameter (typically 0.01) controls the tradeoff between smoothness and responsiveness.

Under the Hood: Diffusion Policy Architecture

Diffusion Policy applies the DDPM (Denoising Diffusion Probabilistic Model) framework to robot action prediction. Instead of generating images from noise (as in image diffusion models), it generates action sequences from noise.

Forward process: During training, noise is progressively added to the ground-truth action sequence over T diffusion steps (typically T=100 for DDPM). At step T, the action is pure Gaussian noise. The model learns to reverse this process: given the noisy action at step t and the observation, predict the noise that was added (epsilon prediction) or the clean action directly.

Network architecture: The denoising network can be either a 1D U-Net (convolutional, operating over the temporal dimension of the action sequence) or a transformer. The U-Net variant is the original and most tested; the transformer variant (DP-T, Diffusion Policy Transformer) is gaining adoption for its better scaling properties. Both take as input: the noisy action sequence, the diffusion timestep t (encoded as a sinusoidal embedding), and the observation encoding from the visual backbone.

Inference: Starting from pure noise, the model iteratively denoises over T steps to produce a clean action sequence. DDPM uses T=100 steps (slow). DDIM (Denoising Diffusion Implicit Models) reduces this to 10-20 steps with minimal quality loss. Consistency distillation collapses the entire denoising process to a single step, matching ACT's inference speed at the cost of additional training complexity.

Multi-modality: The core advantage of Diffusion Policy over ACT is explicit multi-modal action distribution modeling. When a task has multiple valid strategies (approach from left vs. right, grasp from top vs. side), the diffusion model can represent all valid modes in the learned distribution. ACT's CVAE can also represent multiple modes through the latent space, but the KL regularization tends to collapse modes in practice, especially with limited training data. Diffusion Policy preserves modes more reliably because the denoising process can converge to different modes from different noise initializations.

Hybrid Approaches: Combining ACT and Diffusion Policy

Several hybrid approaches have emerged that combine the strengths of both algorithms:

DP-T (Diffusion Policy Transformer): Replaces the U-Net denoising network with a transformer, gaining better scaling with action sequence length and observation complexity. DP-T with DDIM 10-step inference achieves 100ms latency (between ACT and standard DP) with multi-modal action prediction. This is increasingly the default choice for teams with sufficient compute.
Consistency Policy: Distills a trained Diffusion Policy into a single-step generator using consistency training. The result matches ACT's 50ms inference while retaining Diffusion Policy's multi-modal action distribution. The tradeoff: consistency training adds 50-100% to total training time and requires careful hyperparameter tuning of the consistency loss weight.
ACT with multi-head prediction: Instead of a single CVAE decoder, use K decoder heads that each predict a different action chunk. At inference, select the head whose predicted chunk has the lowest reconstruction error against the most recent observation. This provides discrete multi-modality (K modes) at ACT-level inference speed. Practical K values are 3-5.
Hierarchical DP + ACT: Use Diffusion Policy for high-level skill selection (which grasp strategy, which approach direction) at low frequency (1-2 Hz), and ACT for low-level trajectory execution at high frequency (20 Hz). This combines DP's multi-modal planning with ACT's fast reactive control.

Inference Benchmarks by Hardware Platform

Deployment hardware determines which algorithm is viable. These benchmarks use the standard model sizes (ACT: ResNet-18 backbone + 4-layer transformer, ~25M params; Diffusion Policy: ResNet-18 + 1D U-Net, ~55M params; Consistency Policy: same architecture as DP, distilled to single step).

Algorithm	RTX 4090	RTX 3060	Jetson Orin (64GB)	Jetson Orin Nano	Apple M2 (MPS)	CPU (i7-13700)
ACT (chunk=50)	12 ms / 83 Hz	35 ms / 28 Hz	50 ms / 20 Hz	120 ms / 8 Hz	65 ms / 15 Hz	180 ms / 5.5 Hz
ACT (chunk=100)	18 ms / 55 Hz	50 ms / 20 Hz	75 ms / 13 Hz	180 ms / 5.5 Hz	95 ms / 10 Hz	280 ms / 3.5 Hz
DP (DDPM, 100 steps)	150 ms / 6.6 Hz	420 ms / 2.4 Hz	600 ms / 1.7 Hz	1800 ms / 0.6 Hz	550 ms / 1.8 Hz	4200 ms / 0.2 Hz
DP (DDIM, 10 steps)	22 ms / 45 Hz	65 ms / 15 Hz	100 ms / 10 Hz	280 ms / 3.6 Hz	85 ms / 12 Hz	650 ms / 1.5 Hz
DP-T (DDIM, 10 steps)	28 ms / 35 Hz	80 ms / 12.5 Hz	120 ms / 8 Hz	350 ms / 2.9 Hz	105 ms / 9.5 Hz	820 ms / 1.2 Hz
Consistency Policy	15 ms / 66 Hz	45 ms / 22 Hz	70 ms / 14 Hz	200 ms / 5 Hz	60 ms / 16 Hz	350 ms / 2.9 Hz

Key observations: DDPM Diffusion Policy is unusable on anything below a desktop GPU. DDIM with 10 steps brings DP into practical territory on Jetson Orin (10 Hz is sufficient for most manipulation tasks). ACT is the only algorithm that runs at acceptable speed on Jetson Orin Nano (8 Hz with chunk=50). Consistency Policy matches ACT speed while preserving DP's multi-modal properties, making it the best option on mid-range hardware -- if you can afford the extra training time.

For mobile robots (Unitree G1, Mobile ALOHA platforms at SVRC), Jetson Orin is the standard compute module. This means your practical choices are ACT, DDIM DP, or Consistency Policy. DDPM DP is only viable with a desktop GPU connected via network (adds 5-15ms network latency, which is acceptable for slow tasks).

Training Efficiency Comparison

Training cost depends on dataset size, GPU hardware, and algorithm. These benchmarks use a standard setup: 2-camera input (224x224), 7-DOF action space, single NVIDIA GPU.

Scenario	ACT (A100)	ACT (RTX 3060)	DP (A100)	DP (RTX 3060)	CP (A100)
100 demos, 2K epochs	45 min	2.5 hr	2 hr	8 hr	4 hr*
500 demos, 2K epochs	3 hr	12 hr	8 hr	36 hr	16 hr*
2000 demos, 3K epochs	18 hr	72 hr	48 hr	8+ days	96 hr*
Cloud cost estimate (above)	$35	N/A (local)	$95	N/A (local)	$190*

*Consistency Policy training includes the base DP training + consistency distillation. Cloud cost at ~$2/hr for A100 spot instances.

ACT trains 3-4x faster than Diffusion Policy for the same dataset size, which matters significantly during rapid prototyping when you are iterating on data collection strategy and hyperparameters. A typical development cycle involves 5-10 train-evaluate-collect iterations before converging on a working policy. At 500 demos on a local RTX 3060, that is 5-10 days of training for ACT vs. 15-30 days for DP. This alone justifies starting with ACT for initial task exploration, then switching to DP once data collection is finalized and you want maximum performance.

Task-Level Success Rate Comparison

Raw numbers from published evaluations and SVRC internal benchmarks on OpenArm 1 + RealSense D435i. All policies trained on the same demonstration dataset for each task. Success = task completed within 2x the demonstration time.

Task	Demos	ACT	DP (DDIM)	CP	Winner
Pick cube (single position)	50	92%	85%	88%	ACT
Pick cube (random position)	200	78%	84%	82%	DP
Peg insertion (tight fit, 1mm)	300	55%	72%	68%	DP
Bimanual handover	200	75%	70%	72%	ACT
Cloth folding	500	40%	62%	58%	DP
Pour liquid (speed-critical)	150	68%	52%	65%	ACT
Stack 3 blocks (sequential)	400	35%	48%	45%	DP

Pattern: ACT wins on tasks requiring speed (pouring, handover) and with small datasets (single-position pick with 50 demos). DP wins on precision tasks (peg insertion), tasks with high variance (random positions, cloth folding), and longer-horizon tasks (block stacking). The gap narrows as dataset size increases. With 1000+ demos, DP consistently outperforms ACT across all task types.

Common Failure Modes and Debugging

Both algorithms fail in predictable ways. Recognizing the failure pattern tells you whether the fix is more data, different hyperparameters, or a different algorithm entirely.

ACT-Specific Failures

Mode averaging (jerky oscillation near contact): The policy averages two valid strategies (approach from left vs. right) producing a trajectory that goes nowhere. Fix: increase KL weight to encourage mode commitment, or collect more demonstrations that consistently use one strategy. If the task inherently requires multi-modal behavior, switch to Diffusion Policy.
Chunk boundary discontinuities (sudden jerks every N steps): The temporal ensemble is not smoothing enough between consecutive chunks. Fix: decrease the ensemble weight decay (try 0.005 instead of 0.01) or increase chunk overlap (re-predict every 10 steps instead of every 25).
Drift on long tasks (>15 seconds): ACT has no explicit mechanism to correct for accumulated error over long horizons. The policy predicts open-loop chunks that slowly diverge from the intended trajectory. Fix: increase re-observation frequency (re-predict every 5-10 steps instead of executing the full chunk), or add closed-loop error correction between chunks.

Diffusion Policy-Specific Failures

Slow reaction to perturbation: With 100-step DDPM, the policy cannot react to unexpected events (object slips, human intervention) for 500ms. Fix: switch to DDIM 10-step or Consistency Policy. For reactive sub-tasks, use a hybrid architecture with ACT for the reactive component.
Training collapse (loss spikes, NaN gradients): Common with learning rates above 3e-4 or batch sizes below 32. Fix: reduce LR to 1e-4, increase batch size to 64+, add gradient clipping at 1.0. If using mixed precision (fp16), ensure the loss scaler is enabled.
Over-smooth trajectories (gripper opens/closes too slowly): The diffusion denoising process naturally smooths the action distribution, which can round off sharp gripper transitions. Fix: increase the number of DDPM steps during training (try 200 instead of 100) to preserve sharp transitions, or post-process the gripper dimension with a threshold (>0.5 = open, <0.5 = closed).

ONNX Export and TensorRT Optimization

For deployment on edge hardware, both ACT and DP benefit significantly from ONNX export and TensorRT optimization. Typical speedups:

# export_act_onnx.py -- Export ACT to ONNX for TensorRT
import torch
from lerobot.common.policies.act.modeling_act import ACTPolicy

# Load trained checkpoint
policy = ACTPolicy.from_pretrained("path/to/checkpoint")
policy.eval()

# Create dummy inputs matching your observation space
dummy_obs = {
    "observation.images.top": torch.randn(1, 3, 224, 224),
    "observation.images.wrist": torch.randn(1, 3, 224, 224),
    "observation.state": torch.randn(1, 14),  # 7-DOF joints + gripper x 2 arms
}

# Export to ONNX
torch.onnx.export(
    policy,
    (dummy_obs,),
    "act_policy.onnx",
    opset_version=17,
    input_names=["img_top", "img_wrist", "joint_state"],
    output_names=["action_chunk"],
    dynamic_axes={"img_top": {0: "batch"}, "action_chunk": {0: "batch"}}
)

# Convert to TensorRT (on target Jetson device)
# trtexec --onnx=act_policy.onnx --saveEngine=act_policy.trt \
#   --fp16 --workspace=2048

TensorRT FP16 optimization typically provides a 2-3x speedup over PyTorch on Jetson hardware. For ACT on Jetson Orin, this brings inference from 50ms down to 18-22ms. For DDIM DP, from 100ms to 35-45ms. The FP16 quantization has negligible impact on policy quality for both algorithms (less than 1% success rate drop in our testing).

Training Tips for Both Algorithms

# Common training configuration patterns for LeRobot

# ACT config (lerobot/configs/policy/act.yaml)
# Key hyperparameters to tune:
#   chunk_size: 50-100 (longer = smoother but less reactive)
#   kl_weight: 10 (sweep: 1, 5, 10, 50)
#   lr: 1e-4 (reduce to 5e-5 if training is unstable)
#   batch_size: 32-64
#   n_epochs: 2000 (for 200 demos; scale proportionally)

# Diffusion Policy config (lerobot/configs/policy/diffusion.yaml)
# Key hyperparameters:
#   n_diffusion_steps: 100 (DDPM) or 10 (DDIM at inference)
#   prediction_type: "epsilon" (noise prediction; more stable than "sample")
#   lr: 1e-4 with cosine schedule
#   batch_size: 64-128 (DP benefits from larger batches more than ACT)
#   n_epochs: 3000 (for 200 demos; DP converges slower than ACT)

# Both algorithms:
#   - Use action normalization (zero mean, unit variance per dimension)
#   - Use image augmentation (random crop 84-96%, color jitter)
#   - Monitor validation loss every 50 epochs; save best checkpoint
#   - EMA (exponential moving average) of weights at 0.9999 decay