Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders
Abstract
Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations and use of novel training objectives. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training.
Type
Publication
arXiv preprint arXiv:2509.09547 (work done with Adobe Research)