FrameBridge: Improving Image-to-Video Generation with Bridge Models
Anonymous Authors
Abstract
Diffusion models have achieved remarkable progress on image-to-video (I2V) generation, while their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality. In this work, we present FrameBridge.
By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image and improve the consistency between the generation process and I2V task.
Moreover, we propose two novel techniques toward the two popular settings of training I2V models, respectively.
Firstly, we propose SNR-Aligned Fine-tuning (SAF), making the first attempt to fine-tune a diffusion model to a bridge model and, therefore, allow us to utilize the pre-trained diffusion-based text-to-video (T2V) models. Secondly, we propose neural prior, further improving the synthesis quality of FrameBridge when training from scratch.
Experiments conducted on WebVid-2M and UCF-101 demonstrate the superior quality of FrameBridge in comparison with the diffusion counterpart (zero-shot FVD 95 vs. 192 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101), and the advantages of our proposed SAF and neural prior for bridge-based I2V models.
Innovations
FrameBridge models the I2V (frame-to-frames) generation task with a data-to-data generation process of bridge models, rather than using the preivous noise-to-data one of diffusion models, improving the consistency between generative models and generation task.
SNR-aligned fine-tuning (SAF) aligns the noisy intermediate latent of bridge process and diffusion process while remaining the difference between these two generative models, facilitating the leverage of the pre-trained text-to-video (T2V) diffusion models for FrameBridge.
Neural prior finds a stronger prior than the static image for video target, further improving the performance of FrameBridge when training from scratch.
Figure 1: Overview of FrameBridge and diffusion-based I2V models.
The sampling process of FrameBridge (upper) starts from the informative given image, while diffusion models (lower) synthesize videos from uninformative Gaussian noise.
FrameBridge VS Diffusion Counterpart
Figure 2: Visualization for the mean value of marginal distributions.
We visualize the decoded mean value $D(\mu_t(\cdot))$ of bridge process and diffusion process, where $D$ is the pre-trained VAE decoder.
As shown, the prior and target of FrameBridge are naturally fitted with I2V synthesis. The data information is preserved along bridge process, while it gradually vanishes in the forward diffusion process.
Demo samples of FrameBridge:
Sample 1: "waves splash onto the beach, birds flying"
Sample 2: "pot boiling on the campfire"
Condition Image:
I2V Result:
Condition Image:
I2V Result:
Sample 3: "huge waves crashing against the shore"
Sample 4: "leaves of maple trees flutter down"
Condition Image:
I2V Result:
Condition Image:
I2V Result:
Sample 5: "young man typing keyboard, looking at the screen"
Sample 6: "car driving on the road"
Condition Image:
I2V Result:
Condition Image:
I2V Result:
Sample 7: "camera zoom-in, close look at the flower"