CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

arXiv ID: 2512.16023

Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

Organization: University of Freiburg; Ludwig Maximilian University of Munich; MCML; Technical University of Munich; Huawei Heisenberg Research Center (Munich)

Submission time: 2025-12-17

Source: arXiv abs · PDF · Local LaTeX source code analysis

Code/project page: No official public code or project homepage link was found in the arXiv page and LaTeX source code.

One-sentence summary: CoVAR extends the pre-trained video diffusion model OpenSora-1.2 into a "video + action" joint generator: retaining the video DiT backbone, connecting an Action DiT in parallel, and using Bridge Attention to allow the two modalities to exchange information; its goal is not to only generate good-looking future videos, but to generate action sequences that can directly drive the robot to execute.

1. Reading orientation and group meeting guide

Introductory items	What does this paper answer?	Where do you focus on when reading?
Research object	Given an initial image, the robot's initial joint state and language instructions, future videos and frame-by-frame actions are simultaneously generated.	It is not a direct motion regression of traditional VLA, but treats motion as a modality co-generated with the video.
core assumptions	The pre-trained video diffusion model already has a useful prior on visual dynamics, and the action branch should exploit but not destroy this prior.	See if the double branch design and Bridge Attention are really more reasonable than a single joint DiT.
Main contributions	Parallel Action DiT, Bridge Attention, and Action Refinement Model in low-resolution scenes.	Judge "architectural novelty" and "experimental gain" separately: Which module is really supported by ablation?
potential impact	Provides a route for action-free video data to enter robot strategy learning: first learn the video world, and then add action generation.	Note that it currently still relies on expert action data to train the joint model, and it has not been proven that large-scale motion-free videos can directly bring policy improvements.

Recommended reading by the group: Read Fig. 1/2 first to clarify the structural differences of "two-stage, single joint model, CoVAR", and then read the ablation of Eq. (1)(2)(3) and Tab. IV. Don't worry about the details of video diffusion first; the key judgment in this article is "how to benefit the action branch while retaining the video prior".

This paper lies at the intersection of embodied video diffusion / world model / robot policy learning. The typical contradiction it faces is: the video diffusion model can learn visual dynamics from a large number of videos without action labels, but the robot strategy requires actions; if the video is generated first and then inverse dynamics is used to push the action, the video error will be passed to the action, and the action inference will become weaker when the robotic arm or end effector is not visible; if the video and action are directly stuffed into a joint DiT, the already learned video generation capability may be sacrificed.

2. Problem background: Why video generation is not equal to robot strategy

2.1 Problems with the two-stage approach

Two-stage methods typically generate future videos based on initial observations and language targets, and then train inverse dynamics or policy networks to convert video plans into actions. This idea is intuitive, but there are two bottlenecks:

Error cascade: There are slight deviations in the position of the object, the posture of the hand, or the moment of contact in the video, and subsequent motion regression will amplify these deviations.
Visibility dependencies: If the robot body is blocked or cropped in the video, or the camera perspective is insufficient, it will be difficult to restore real joint control by visual changes alone.

2.2 Problems with early fusion of joint models

Another type of method splices video tokens and action tokens into the same DiT for joint denoising. The advantage is that information sharing is direct, but the disadvantage is that the modalities are very different: video is a high-dimensional spatiotemporal visual latent, and actions are low-dimensional continuous control vectors. The paper believes that when expert demonstration data is limited, allowing a unified DiT to adapt to both modalities at the same time may interfere with the visual knowledge of the pre-trained video model.

2.3 A middle path for CoVAR

The compromise of CoVAR is "dual branches + controlled communication": the video branch follows the pre-trained OpenSora-1.2 video DiT, and the action branch adds a new Action DiT; the two are not completely isolated, but exchange information through Bridge Attention at each location that requires interaction. This avoids pure two-stage posterior motion inference and avoids excessive rewriting of the video prior by a single joint DiT.

Figure 1: The paper's structural comparison of the two-stage model, joint model and CoVAR. The key to CoVAR is to connect Action DiT in parallel while allowing the action branch to access the information of the video branch through Bridge Attention.

3. Method disassembly: Multi-modal Rectified Flow + Double DiT + Bridge Attention

3.1 Input and output definition

The model input includes the initial observation image $v_0 \in \mathbb{R}^{3 \times H \times W}$, the initial robot joint state $a_0 \in \mathbb{R}^{L}$ and the language command $c$. The output is the future video $v \in \mathbb{R}^{T \times 3 \times H \times W}$ and the paired action sequence $a \in \mathbb{R}^{T \times L}$. The action here is not inferred after generating the video, but is co-generated with the video during the diffusion/flow process.

$v_0$	Initial RGB observation image.
$a_0$	Initial joint state, dimension is $L$.
$c$	Natural language task instructions.
$v$	The generated future video sequence is of length $T$.
$a$	The generated action sequences are time-aligned with the video frames.

3.2 Multi-modal Rectified Flow

This paper uses rectified flow to model the generation path of joint modes. Let the joint data sample be $X_0=(x_0^1, x_0^2)\sim \Pi_{data}$, where $x_0^1$ is the video latent and $x_0^2$ is the action. The noise end is $X_1=(x_1^1, x_1^2)\sim(\mathcal{N}(0, I_{d_1}), \mathcal{N}(0, I_{d_2}))$. The ODE corresponding to the linear path is:

$$\frac{dX_t}{dt}=X_1-X_0, \quad t\in[0, 1].$$

Intuition:

Rectified flow hopes to learn a "straight path velocity field" from data to noise, or from noise back to data when sampling. The multimodal version simply treats the video latent and the action together as a joint state.

Neural network $v_\theta=(v_\theta^1, v_\theta^2)$ predicts the vector fields of two modes, and the training loss is:

$$L=\|x_1^1-x_0^1-v_\theta^1\|_2+\|x_1^2-x_0^2-v_\theta^2\|_2.$$

Reading reminder:

The formula in the paper directly adds the errors of the two modes, without explaining whether the weights, normalization, and time sampling of the two modes are completely shared. To reproduce, this is a detail that needs to be confirmed from the code, but no official code link is currently found.

3.3 Model architecture: retain the video backbone and connect Action DiT in parallel

CoVAR is built on OpenSora-1.2. The video branch retains the pre-trained video diffusion backbone; the action branch uses a parallel Action DiT. The dimensionality of action data is low, so the author did not train VAE for actions, but used a lightweight MLP encoder to obtain action embeddings. Action DiT also receives the text instruction $c$ through cross-attention, forming a conditional generation structure symmetrical to Video DiT.

Figure 2: CoVAR overview. (A) Pre-trained video DiT + parallel Action DiT; (B) Bridge Attention is responsible for cross-modal communication; (C) Action Refinement Module is used on low-resolution data sets.

3.4 Bridge Attention

The goal of Bridge Attention is to allow two modalities to interact but retain their respective representation spaces. Let the video characteristics be $f_v \in \mathbb{R}^{B\times N_v\times C}$, the action characteristics are $f_a \in \mathbb{R}^{B\times N_a\times C}$. Unlike standard self-attention, which uses the same set of Q/K/V projections to process spliced tokens, Bridge Attention parameterizes query, key, and value for video and action respectively:

$$ \begin{bmatrix}f_v\\f_a\end{bmatrix} = \mathrm{Attention}\left( \begin{bmatrix}q_1 f_v\\q_2 f_a\end{bmatrix}, \begin{bmatrix}k_1 f_v\\k_2 f_a\end{bmatrix}, \begin{bmatrix}v_1 f_v\\v_2 f_a\end{bmatrix} \right). $$

Intuition:

It's like "meeting after each translates into their own Q/K/V language". Projections within modalities remain independent, but the attention matrix still allows video tokens and action tokens to read each other.

The paper compares it with two alternative communication methods: one is direct self-attention to splice all tokens, and the other is bidirectional cross-attention. Ablation shows that Bridge Attention is better in both video quality and real-world task success rate.

3.5 Action decoder and Action Refinement

The author emphasizes that action decoder is critical to action accuracy and training convergence. CoVAR uses UNet as the action decoder instead of the common MLP or ResNet. The explanation of the paper is that UNet's multi-scale processing is more suitable for capturing the hierarchical motion structure in temporal action sequences.

For low-resolution data sets such as Libero90, the paper additionally uses Action Refinement Model. The original CoVAR first generates coarse actions, and then the refinement module receives coarse actions, initial image tokens, and text conditions to turn coarse actions into more refined controls. This module is very critical on Libero90: the success rate without refinement is significantly lower than the full model.

Figure 3: Visualization of Action refinement. Without refinement, the action only captures the general trend; after adding refinement, the trajectory is more suitable for completing fine grabbing and placement.

3.6 Training/inference pseudocode

训练阶段：
for each demo (v0, a0, instruction c, future video v, action sequence a):
    encode video into video latent x0_video
    encode action into action embedding x0_action
    sample Gaussian noise x1_video, x1_action
    sample time t
    interpolate joint state Xt between X0 and X1
    Video DiT predicts video flow with text/image conditions
    Action DiT predicts action flow with text/joint-state conditions
    Bridge Attention exchanges video/action information
    optimize video flow loss + action flow loss

推理阶段：
given current observation, current joint state, instruction:
    initialize video/action noise
    integrate learned rectified flow for 30 sampling steps
    decode video frames and actions
    if low-resolution setting: refine coarse actions
    interpolate generated 35-frame open-loop actions to 100 Hz robot control

4. Experiments and results: video quality, action success rate and ablation

4.1 Dataset and training settings

Dataset	Size/Features	CoVAR settings
CALVIN	About 20k teleoperated demonstrations, with text instructions, video resolution 200×200.	Training setting ABCD, randomly generate 200 novel test scenes for rollout.
Libero90	90 tasks, 50 expert demonstrations per task, video resolution 128×128.	Use action refinement; the refinement model is fine-tuned with 450 video-action pairs.
Real dataset	1K demos collected by the author, including bowl stacking, nut/screw/tenon picking and placing, etc.	Resolution 180×320; UR5 platform; generate video-action pair every 35 frames.

The total number of parameters of the model is about 1.4B, including 1.1B for the video diffusion part and 0.3B for the new module. Training takes about 1 day, using 4 GPUs. During real robot inference, the rectified flow sampling step is set to 30, and it takes about 4 seconds to generate a 35-frame video-action pair; the robot control frequency is 100 Hz, so the generated open-loop action sequence needs to be interpolated.

4.2 Video quality

Video quality is measured with PSNR, SSIM, LPIPS and FVD. CoVAR has an overall advantage over joint-model baselines such as UVA, PAD, and UWM on CALVIN and Libero90, and is close to the pure video model OpenSora-1.2. This supports the author's core claim: adding action modality does not significantly damage the video generation capabilities of the pre-trained video model.

Dataset	method	PSNR ↑	SSIM ↑	LPIPS ↓	FVD ↓
CALVIN	UVA	19.01	0.758	0.180	97.90
	PAD	18.72	0.734	0.174	83.40
	UWM	18.04	0.730	0.181	85.85
	OpenSora	19.60	0.768	0.171	61.00
	CoVAR	19.95	0.766	0.156	72.42
Libero90	UVA	19.57	0.716	0.154	86.21
	PAD	19.65	0.781	0.218	98.39
	UWM	19.87	0.735	0.212	87.83
	OpenSora	20.18	0.817	0.156	63.33
	CoVAR	20.09	0.826	0.143	70.64

Figure 4: Comparison of generated video quality. The authors claim that CoVAR produces fewer artifacts than joint-model baselines and maintains better robot and object consistency.

4.3 Action success rate

Action evaluation can better reflect the value of the paper. On CALVIN, CoVAR is better than UVA/UWM/PAD/Unipi in five tasks: drawer, cabinet, light, pick, and push. On Libero90, the full CoVAR is significantly better than the version without refinement, indicating that refinement is not a decoration module, but one of the core sources of success rate in low-resolution scenes.

CALVIN method	Drawer	Cabinet	Light	Pick	Push
UVA	0.875	0.667	0.711	0.758	0.785
UWM	0.813	0.733	0.644	0.576	0.714
PAD	0.781	0.467	0.489	0.485	0.642
Unipi	0.469	0.267	0.289	0.182	0.452
CoVAR	1.000	0.800	0.867	0.909	0.929

Libero90 method	Pick-and-place	Open/Close	Combination
UVA	0.676	0.640	0.489
UWM	0.606	0.600	0.400
PAD	0.625	0.480	0.355
CoVAR w/o refinement	0.592	0.520	0.422
CoVAR	0.873	0.860	0.711

Real method	Nut	Screw	Dowel
Unipi	0.00	0.06	0.02
RoboEnvision	0.04	0.10	0.12
CoVAR	0.64	0.74	0.70

Real-world generated videos and rollouts

Figure 5: Generated video and rollout in a real robot experiment. The paper uses sequential pick-and-place to illustrate that video and action alignment can fall onto physical execution.

4.4 Ablation experiment

Ablation is performed on real data sets collected by the author. Bridge Attention, UNet action head, and video branch all contribute significantly to the results. Especially after removing the video branch, the action success rate dropped to 0.08, indicating that the action branch does not just learn strategies directly from language and initial images, but strongly relies on the dynamic information and pre-training priors provided by the video branch.

Variants	PSNR ↑	SSIM ↑	LPIPS ↓	FVD ↓	Success ↑
w/o BA (SA)	16.83	0.693	0.255	137.66	0.32
w/o BA (CA)	16.56	0.645	0.263	145.26	0.20
w/o UNet	16.85	0.690	0.255	141.62	0.24
w/o video	-	-	-	-	0.08
CoVAR	17.67	0.736	0.238	133.89	0.68

Figure 6: Ablation visualization. Red lines are real actions, blue lines are generated actions; differences in trajectories for different attention alternatives are also shown w/o BA.

5. Intensive reading of charts

5.1 Fig. 1: What the paper really wants to say is not "joint generation" itself

The three-column comparison of Fig. 1 is critical. The problem with the two-stage model is that there is no end-to-end alignment between video and action; the problem with the joint model is that all modalities are mixed prematurely in the same DiT; CoVAR's proposition is that "actions require video priors, but they should not swallow the video trunk". Therefore, the technical center of the paper is not to propose another multi-modal diffusion, but to propose a communication structure for the asymmetric mode of robot video/action.

5.2 Fig. 2: Positioning of Action DiT

Action DiT is not a small inverse dynamics head. It participates in rectified flow denoising and generates a complete action sequence together with the video branch. The action branch can read text, or read video branch information through Bridge Attention. This positioning explains why w/o video has a very low success rate: the video branch not only outputs visualization results, but also assumes the role of an intermediate representation of dynamic priors.

5.3 Fig. 7: Video-action trajectory alignment

Figure 7: Generating video-action pairs. The red line in the figure marks the ground truth, and the blue line marks the generated action, showing trajectory matching on different data sets and platforms.

This diagram is most suitable for discussing the relationship between "video quality indicators" and "action executability" in group meetings. Just because a video looks reasonable does not automatically mean that the action is executable; the advantage of CoVAR is that it makes action generation directly constrained by video dynamics modeling, rather than just estimating actions from video after the fact.

5.4 How to read tables: Don't just read the bold text

In the video quality table, CoVAR does not always beat OpenSora, especially OpenSora is better on FVD. This is actually reasonable: OpenSora is a pure video generation model, and CoVAR is additionally responsible for action generation. What the paper really needs to prove is that CoVAR is better than joint baselines videos, and the action success rate is also higher. By this standard, the results support the claim.

6. Reproducible list and implementation details

6.1 reproducibility parameters that can be extracted directly from the paper

Project	The settings given in the paper
Basic code/model	OpenSora-1.2 codebase; the video diffusion part is about 1.1B parameters.
New module	About 0.3B parameters, including Action DiT, Bridge Attention related parameters, UNet action decoder/refinement, etc.
total parameters	About 1.4B.
Number of training frames	Each piece of data samples 35 frames.
training resources	About 1 day, 4 GPUs.
Real data resolution	180×320 for faster convergence and inference.
Real robot reasoning	35 frames video-action pair; rectified flow sampling step = 30; each segment is about 4 seconds.
control execution	UR5 platform; 100 Hz control of the robot, interpolation of open-loop motion sequences.
Libero90 refinement	Fine-tuning the action refinement model with 450 video-action pairs.

6.2 Details still need to be confirmed by code

The specific VAE/patch/token shape of the video latent, and the depth, width, and attention insertion positions of the Action DiT.
Whether video loss and action loss are scale normalized, and whether there are additional weights.
Whether the two modes share the same time step $t$ and noise scheduling; the paper should intuitively share, but the formula is not expanded in detail.
The specific input arrangement of the UNet action decoder: whether the action token is a 1D/2D structure based on the time dimension, or is it mapped into a similar feature map.
Action refinement's training goal, learning rate, whether the input coarse action is detachable, and whether it is always enabled during inference.
The number of trials and failure statistics of each task in real robot experiments.

Recurrence risk: The paper gives the high-level structure and main training settings, but lacks an executable-level hyperparameter table. Without official code, the difficulties in reproducing will focus on the transformation of OpenSora-1.2, the Bridge Attention insertion layer, action tokenization, and real robot action normalization.

6.3 Relationship with existing routes

Both CoVAR and UVA/UWM/PAD belong to the "joint video and action modeling" camp, but it places more emphasis on reusing pre-trained video diffusion models. Compared with two-stage methods such as Unipi/RoboEnvision, it avoids completely handing over action learning to posterior inverse dynamics. It can be understood as an architectural choice in embodied diffusion: the video branch retains strong visual priors, the action branch is controlled by independent DiTology, and then coupling is established through controlled attention.

7. Critical discussion and group meeting questions

7.1 Strong points of the paper

The question is precise: The action label gap between video proliferation and bot strategies is a real problem.
The architectural motivation is clear: Parallel Action DiT is more consistent with the characteristics of the video/action asymmetric modality than "shocking action into video DiT".
Ablation is supported by: Bridge Attention, UNet decoder, and video branches all reflect their contributions through ablation.
Real experiments are convincing: The success rate of real small object operations is much higher than that of the second-stage baseline, which shows that the accuracy of movements is not only effective in simulation.

7.2 Points to be cautious about

Exploitation of large-scale action-free video has not yet been fully demonstrated: The motivation of the paper is that CoVAR helps to utilize large-scale video data, but the experiments are still mainly on data sets with action demonstrations.
Reasoning speed is slow: The real robot generates 35 frames per segment for about 4 seconds, which is suitable for open-loop chunking, but it is still far from high-frequency closed-loop control.
The generalization boundary of Action refinement is unclear: Libero90 has greatly improved, but it itself uses 450 pairs of video-action data to fine-tune, and whether it can be migrated to more complex real scenes still needs to be verified.
Missing 3D geometry: The author also admits that currently only monocular videos are used, and the understanding of spatial geometry is limited.
Real data is smaller: 1K demos are enough to demonstrate proof-of-concept, but not enough to demonstrate robustness in large-scale real operations.

7.3 Group meeting discussion question 1: Where does Bridge Attention's revenue come from?

It allows you to compare three information exchange methods: direct self-attention, bidirectional cross-attention, and Bridge Attention. The key question is: does the improvement in Bridge Attention come from "modality-specific Q/K/V projection", or does it come from a larger number of parameters and better initialization paths? If further proof is needed, the ideal experiment should control the number of parameters, make the self-attention baseline the same size, and report the attention map or inter-modal token reading strength.

7.4 Group meeting discussion question 2: Is CoVAR a world model, a policy, or a data generator?

CoVAR generates future videos and actions at the same time, so it can be interpreted as a world model, can also be used as an open-loop policy, and can also add action labels to video data. The three positionings will lead to different evaluation criteria: the world model looks at the long-term prediction consistency, the policy looks at the closed-loop success rate and safety, and the data generator looks at whether the generated samples can improve the downstream strategy. The strongest evidence currently in the paper is the policy success rate; the evidence of the "scalable data generator" also needs to be supplemented by downstream data enhancement experiments.

7.5 Follow-up research directions

Add 3D representation: Combined with depth, point cloud or 3D foundation model, extend monocular video priors to more reliable spatial inference.
Closed loop: Change the 35-frame open-loop chunking to receding horizon, and use real-time observations to correct the deviation.
Verify action-free video scaling: Pre-train the video branch with a large number of motionless robot videos or human operation videos, and then align it with a small amount of motion data.
More stringent action refinement ablation: Test the help of refinement on other baselines to confirm whether it is a CoVAR-specific advantage or a general post-processor.

Final verdict: The value of this paper is that it provides a very specific architectural answer to "how a pre-trained video diffusion model can be turned into a robot action generator". It does not completely solve the problem of VLA or world model, but for junior PhD students who want to study video diffusion for robotics, it is a paper worth reading about architecture and ablation.