SpatialStack | CVPR 2026

Abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding.

To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

ROI Similarity

ROI similarity across encoder depths. For two indoor scenes, the top row shows the RGB image with the ROI marked in red. The lower rows display similarity maps (brighter means more similar) at 50%, 75%, and 100% depths of the geometry encoder (left) and the vision encoder (right). Geometry features preserve meaningful spatial structure, while vision features are noisy and become nearly uniform at deeper layers.

SpatialStack stacks geometry tokens because the geometry stream preserves ROI structure across encoder depths, while vision features become noisy and nearly uniform at deeper layers, motivating hierarchical geometry-language fusion.

Architecture

Architecture of SpatialStack. A standard VLM backbone is coupled with a multi-view geometry encoder whose layer-wise features are processed by lightweight projectors and sequentially injected into the decoder, progressively integrating geometric cues to preserve fine structure and global spatial context.

SpatialStack keeps the vision encoder unchanged and augments the language decoder with a parallel VGGT geometry stream. Geometry tokens from several semantic depths are compressed by lightweight mergers and injected as residual adapters throughout the decoder, so each stage reasons over aligned visual, geometric, and textual cues while preserving both local structure and global scene context.

VSI-Bench Results

Model	Rank	Avg	Obj Count	Abs Dist	Obj Size	Room Size	Rel Dist	Rel Dir	Route Plan	Appr Order
Baseline
Chance Level (Random)	-	-	-	-	-	-	25.0	36.1	28.3	25.0
Chance Level (Frequency)	-	34.0	62.1	32.0	29.9	33.1	25.1	47.9	28.4	25.2
Proprietary Models (API)
GPT-4o	2	34.0	46.2	5.3	43.8	38.2	37.0	41.3	31.5	28.5
Gemini-2.5 Pro	1	51.5	43.8	34.9	64.3	42.8	61.1	47.8	45.9	71.3
Open-source Models
LongVILA-8B	15	21.6	29.1	9.1	16.7	0.0	29.6	30.7	32.5	25.5
Qwen2.5-VL-3B	14	28.7	33.1	19.4	17.4	24.8	37.3	44.3	31.4	22.0
VILA-1.5-8B	13	28.9	17.4	21.8	50.3	18.8	32.1	34.8	31.0	24.8
LongVA-7B	12	29.2	38.0	16.6	38.9	22.2	33.1	43.3	25.4	15.7
VILA-1.5-40B	11	31.2	22.4	24.8	48.7	22.7	40.5	25.7	31.5	32.9
LLaVA-OneVision-7B	10	32.4	47.7	20.2	47.4	12.3	42.5	35.2	29.4	24.4
LLaVA-Video-7B	9	35.6	48.5	14.0	47.8	24.2	43.5	42.4	34.0	30.6
LLaVA-OneVision-72B	8	40.2	43.5	23.9	57.6	37.5	42.5	39.9	32.5	44.6
LLaVA-Video-72B	7	40.9	48.9	22.8	57.4	35.3	42.4	36.7	35.0	48.6
Spatial-MLLM-4B	6	47.0	65.3	34.8	63.1	45.1	41.3	46.2	33.5	46.3
VG-LLM-4B	5	47.3	66.0	37.8	55.2	59.2	44.6	45.6	33.5	36.4
Qwen3.5-4B	4	53.6	56.5	36.5	67.5	53.8	60.3	57.5	34.0	62.3
Cambrian-S-3B	3	57.3	70.7	40.6	68.0	46.3	64.8	61.9	27.3	78.8
SpatialStack-4B (Qwen2.5)	2	60.9	69.2	45.4	63.0	63.2	57.9	68.4	40.2	79.6
SpatialStack-5B (Qwen3.5)	1	67.5	71.0	55.6	69.1	68.2	67.3	84.1	41.2	83.5

VSI-Bench leaderboard. SpatialStack-5B reaches the top open-source average (67.5) and leads every benchmark category.

VSI-Bench spans more than 5,000 egocentric QA pairs across eight categories: four numerical tasks (object count, absolute distance, object size, room size) and four multiple-choice tasks (relative distance, relative direction, route planning, appearance order). Built on Qwen3.5, SpatialStack-5B improves over both Qwen3.5-4B and prior dual-encoder baselines, setting a new open-source state of the art across the full benchmark.

CV-Bench Results

Model	2D (%)	3D (%)	Avg. (%)
Proprietary Models (API)
GPT-4o	74.8	83.0	78.9
Open-source Models
Mini-Gemini-HD-34B	71.5	79.2	75.4
LLaVA-NeXT-34B	73.0	74.8	73.9
Cambrian-1-34B	74.0	79.7	76.9
SAT-LLaVA-Video-7B	73.0	83.8	78.4
SPAR-8B	72.3	89.1	80.7
Qwen2.5-VL-3B	67.9	70.4	69.2
Qwen3.5-4B	79.7	90.2	85.0
Cambrian-S-3B	76.1	76.3	76.2
Dual-Encoder MLLMs
VG-LLM-4B	71.3	87.7	79.5
SpatialStack-4B (Qwen2.5)	75.4	87.0	81.2
SpatialStack-5B (Qwen3.5)	78.9	92.2	85.5

CV-Bench leaderboard. SpatialStack-5B sets the strongest reported average (85.5) and best 3D score (92.2) among compared models.

CV-Bench reformulates classical 2D relation/counting tasks and 3D depth/distance questions into QA evaluations. Built on Qwen3.5, SpatialStack-5B outperforms its base model as well as prior dual-encoder baselines, delivering the best overall accuracy (85.5) and strongest 3D result (92.2).

Citation

If you find this work useful for your research, please consider citing our paper:

@article{zhang2026spatialstack, title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning}, author={Jian Zhang and Shijie Zhou and Bangya Liu and Achuta Kadambi and Zhiwen Fan}, journal={arXiv preprint arXiv:2603.27437}, year={2026} }

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Video