SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

1 Texas A&M University, 2 UCLA, 3 University of Wisconsin-Madison
* Equal contribution
CVPR 2026
SpatialStack teaser figure
SpatialStack: Layered Geometry-Language Fusion. Conventional VLM-geometry fusion (a) is typically performed only once, injecting a single deep geometry feature into the vision-language stack and limiting both fine-grained perception and high-level spatial reasoning. SpatialStack (b) stacks multi-level geometry features and injects them hierarchically into decoder layers, yielding stronger 3D spatial understanding across benchmarks.

Abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding.

To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

ROI Similarity

ROI similarity comparison between geometry and vision features across encoder depths
ROI similarity across encoder depths. For two indoor scenes, the top row shows the RGB image with the ROI marked in red. The lower rows display similarity maps (brighter means more similar) at 50%, 75%, and 100% depths of the geometry encoder (left) and the vision encoder (right). Geometry features preserve meaningful spatial structure, while vision features are noisy and become nearly uniform at deeper layers.

SpatialStack stacks geometry tokens because the geometry stream preserves ROI structure across encoder depths, while vision features become noisy and nearly uniform at deeper layers, motivating hierarchical geometry-language fusion.

Architecture

Architecture diagram
Architecture of SpatialStack. A standard VLM backbone is coupled with a multi-view geometry encoder whose layer-wise features are processed by lightweight projectors and sequentially injected into the decoder, progressively integrating geometric cues to preserve fine structure and global spatial context.

SpatialStack keeps the vision encoder unchanged and augments the language decoder with a parallel VGGT geometry stream. Geometry tokens from several semantic depths are compressed by lightweight mergers and injected as residual adapters throughout the decoder, so each stage reasons over aligned visual, geometric, and textual cues while preserving both local structure and global scene context.

VSI-Bench Results

Model Rank Avg Obj Count Abs Dist Obj Size Room Size Rel Dist Rel Dir Route Plan Appr Order
LongVILA-8B1421.629.19.116.70.029.630.732.525.5
Qwen2.5-VL-3B1328.733.119.417.424.837.344.331.422.0
VILA-1.5-8B1228.917.421.850.318.832.134.831.024.8
LongVA-7B1129.238.016.638.922.233.143.325.415.7
Qwen2.5-VL-7B1029.225.210.935.829.238.737.529.426.7
VILA-1.5-40B931.222.424.848.722.740.525.731.532.9
LLaVA-OneVision-7B832.447.720.247.412.342.535.229.424.4
LLaVA-Video-7B735.648.514.047.824.243.542.434.030.6
LLaVA-OneVision-72B640.243.523.957.637.542.539.932.544.6
LLaVA-Video-72B540.948.922.857.435.342.436.735.048.6
Spatial-MLLM-4B447.065.334.863.145.141.346.233.546.3
VG-LLM-4B347.366.037.855.259.244.645.633.536.4
Cambrian-S-3B257.370.740.668.046.364.861.927.378.8

VSI-Bench leaderboard. SpatialStack-4B reaches the top open-source average (60.9) and leads most sub-tasks.

VSI-Bench spans more than 5,000 egocentric QA pairs across eight categories: four numerical tasks (object count, absolute distance, object size, room size) and four multiple-choice tasks (relative distance, relative direction, route planning, appearance order). Despite having no route-planning supervision, SpatialStack generalizes strongly to that category while remaining competitive on the rest of the benchmark.

CV-Bench Results

Model 2D (%) 3D (%) Avg. (%)
Proprietary Models (API)
GPT-4o74.883.078.9
Open-source Models
Mini-Gemini-HD-34B71.579.275.4
LLaVA-NeXT-34B73.074.873.9
Cambrian-1-34B74.079.776.9
SAT-LLaVA-Video-7B73.083.878.4
SPAR-8B72.389.180.7
Qwen2.5-VL-3B67.970.469.2
Qwen2.5-VL-7B73.980.977.4
Cambrian-S-7B74.383.078.7
Cambrian-S-3B76.176.376.2
Dual-Encoder MLLMs
VG-LLM-4B71.387.779.5
VG-LLM-8B72.291.181.7
SpatialStack-4B75.487.081.2
SpatialStack-8B76.190.883.5

CV-Bench leaderboard. SpatialStack-4B and -8B outperform existing dual-encoder baselines across splits.

CV-Bench reformulates classical 2D relation/counting tasks and 3D depth/distance questions into QA evaluations. SpatialStack-4B already overtakes VG-LLM-4B on both splits, while SpatialStack-8B delivers the best overall accuracy (83.5) and matches the strongest reported 2D score.