3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

Nov 4, 2025·
Seonho Lee*
,
Jiho Choi*
,
Inha Kang
,
Jiwook Kim
,
Junsung Park
,
Hyunjung Shim
· 0 min read
Abstract
We propose a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling sparse correspondences, relative depth relations, and dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image–text inputs.
Type
Publication
In Findings of the Association for Computational Linguistics: EMNLP 2025