Table 1. Updated VBench-I2V scores. CM means Camera Motion. *: Directly use the score of Table 3 in [1], and compute total score with the code released along with the paper (https://github.com/Vchitect/VBench). The default resolution is 256x256. Note that VBench-I2V is a benchmark suitable for different video resolutions. For FrameBridge and baselines not mentioned in [1], we compute VBench-I2V scores with the code of [1].
Model Total Score I2V Score Quality Score CM I2V SC I2V BC SC BC MS DD AQ IQ TF
DynamiCrafter-256 84.35 91.29 77.42 22.18 95.40 96.22 94.60 98.30 97.82 38.69 59.40 62.29 97.03
SEINE 82.12 88.60 75.64 15.91 93.45 94.21 93.94 97.01 96.20 24.55 56.55 70.52 95.07
SEINE-512x320* 83.49 89.62 77.37 23.36 94.85 94.02 94.20 97.26 96.68 34.31 58.42 70.97 96.72
SparseCtrl 80.34 85.13 75.54 25.82 88.39 92.46 85.08 93.81 94.25 81.95 49.88 69.35 91.78
ConsistI2V* 83.30 90.38 76.22 33.60 94.69 94.57 95.27 98.28 97.38 18.62 59.00 66.92 97.56
FrameBridge-VideoCrafter 85.37 92.83 77.92 30.72 96.24 97.25 94.63 98.92 98.51 35.77 59.38 63.28 98.01
FrameBridge-CogVideoX 85.93 95.22 76.65 92.06 95.42 97.13 93.60 98.62 97.57 48.29 54.28 60.00 96.61
Table 2. Dynamic Degree score evaluated with VBench-I2V. The two techniques used in diffusion I2V models can also improve the dynamic degree of FrameBridge.FrameStride: Add frame-stride condition. NoisyCondition: Add noise to image condition.
Model DD
DynamiCrafter-256 38.69
FrameBridge-VideoCrafter 35.77
FrameBridge-VideoCrafter-FrameStride 46.26
FrameBridge-VideoCrafter-NoisyCondition 48.62
Table 3. Ablation study of fine-tuning from VideoCrafter.
Model Base Model FVD CLIPSIM PIC
Diffusion (i.e., DynamiCrafter) VideoCrafter 192 0.2245 0.6131
FrameBridge w.o.SAF VideoCrafter 299 0.2246 0.5559
FrameBridge w.SAF VideoCrafter 99 0.2250 0.6963
Table 4. Ablation study of fine-tuning from CogVideoX
Model Base Model FVD CLIPSIM PIC
Diffusion CogVideoX 118 0.2250 0.7659
FrameBridge w.o.SAF CogVideoX 178 0.2250 0.7104
FrameBridge w.SAF CogVideoX 107 0.2250 0.7731
Table 5. Zero-shot MSR-VTT results and VBench-I2V total score with 24 frames.
Model FVD ↓ CLIPSIM ↑ PIC ↑ VBench-I2V Total Score ↑
Diffusion 171.25 0.2251 0.7231 82.98
FrameBridge 147.73 0.2251 0.7586 84.01
Figure 1. CD-FVD curves of DynamiCrafter and FrameBridge.
Table 6. Videos generated with VBench-I2V prompts. (1) The high Dynamic Degree (DD) score of SparseCtrl may come at the cost of temporal consistency, which may potentially degrade the quality of videos as reflected by the VBench-I2V total score. All the videos are displayed with the original resolution generated by the model (i.e. 256 x 384 for SparseCtrl and 256 x 256 for other models). The calculation of VBench-I2V score is fair as it is a benchmark suitable for different video resolutions. (2) Compared with two diffusion baselines DynamiCrafter (DD=38.69) and ConsistI2V (DD=18.62), the dynamic degree of videos generated by FrameBridge-CogVideoX (DD=48.29) is not lower.
Static Image: Text Condition:
a bridge that is in the middle of a river, camera zooms out
FrameBridge-CogVideoX (DD=48.29, Total Score=85.93): DynamiCrafter (DD=38.69, Total Score=84.35): ConsistI2V (DD=18.62, Total Score=83.30):
SparseCtrl (DD=81.95, Total Score=80.34):
Table 7. Our experiments on FrameBridge-VideoCrafter demonstrate that FrameBridge can also leverage techniques used in diffusion I2V models to improve dynamic degree.
Static Image: Text Condition:
leaves blown off by the wind
FrameBridge-VideoCrafter: DynamiCrafter: FrameBridge-VideoCrafter-MI:
Table 8. Our experiments on FrameBridge-VideoCrafter demonstrate that FrameBridge can also leverage techniques used in diffusion I2V models to improve dynamic degree.
Static Image: Text Condition:
a blue car driving down a dirt road near train tracks
FrameBridge-VideoCrafter: DynamiCrafter: FrameBridge-VideoCrafter-MI:
Table 9. More samples generated by FrameBridge.
Sample 1: "camera zoom-in, close look at the apple" Sample 2: "fireworks in the night sky over a city"
Condition Image: I2V Result: Condition Image: I2V Result:
Table 10. More samples generated by FrameBridge.
Sample 3: "bird flying off the tree" Sample 4: "a castle on top of a hill covered in snow, camera pans left"
Condition Image: I2V Result: Condition Image: I2V Result:
Table 11. More samples generated by FrameBridge.
Sample 5: "the table is rotating" Sample 6: "a great white shark swimming in the ocean"
Condition Image: I2V Result: Condition Image: I2V Result:
Table 12. Zero-shot metrics on MSR-VTT and UCF-101.
Model MSR-VTT UCF-101
FVD ↓ CLIPSIM ↑ PIC ↑ FVD ↓ IS ↑ PIC ↑
Coupling Flow-Matching, σ = 0 1047 0.2249 0.4484 2066 14.87 0.4275
Coupling Flow-Matching, σ = 1 110 0.2249 0.6936 342 40.81 0.6419
Vanilla Flow-Matching 204 0.2249 0.5701 370 36.27 0.6070
Diffusion (i.e., DynamiCrafter) 192 0.2245 0.6163 485 29.46 0.6266
FrameBridge 99 0.2250 0.6963 312 39.89 0.6697