3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
Nov 4, 2025·,,,,,·
0 min read
Seonho Lee*
Jiho Choi*
Inha Kang
Jiwook Kim
Junsung Park
Hyunjung Shim
Abstract
We propose a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling sparse correspondences, relative depth relations, and dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image–text inputs.
Type
Publication
In Findings of the Association for Computational Linguistics: EMNLP 2025