Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders

Sep 11, 2025·

Dohun Lee*

Hyeonho Jeong*

Jiwook Kim

Duygu Ceylan

Jong Chul Ye

· 0 min read

PDF Cite

Abstract

Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations and use of novel training objectives. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training.

Type

Preprint

Publication

arXiv preprint arXiv:2509.09547 (work done with Adobe Research)

Last updated on Mar 30, 2026

Video Generation Diffusion Models

← 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation Nov 4, 2025

DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation Apr 24, 2025 →