3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Nov 4, 2025·

Seonho Lee*

Jiho Choi*

Inha Kang

Jiwook Kim

Junsung Park

Hyunjung Shim

· 0 min read

PDF Cite Code DOI

Abstract

We propose a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling sparse correspondences, relative depth relations, and dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image–text inputs.

Type

Conference paper

Publication

In Findings of the Association for Computational Linguistics: EMNLP 2025

Last updated on Mar 30, 2026

Vision-Language Models 3D Vision

← Representation Alignment for Just Image Transformers is not Easier than You Think Mar 17, 2026

Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders Sep 11, 2025 →