3AM: 3egment Anything with Geometric Consistency in Videos

1National Yang Ming Chiao Tung University 2NVIDIA Research
3AM Overview

3AM maintains consistent object tracks across large viewpoint changes, cluttered scenes, and variations in capture conditions, where traditional 2D VOS methods fail.

Abstract

Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing.

We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning.

Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 points.

Method Overview

3AM Pipeline

3AM Pipeline: Our Feature Merger fuses multi-level MUSt3R features, learned from multi-view consistency to encode implicit geometric correspondence, with SAM2's appearance features via cross-attention and convolutional refinement. These merged geometry-aware representations then undergo memory attention with previous frames and mask decoding, enabling spatially-consistent object recognition that maintains identity across large viewpoint changes without requiring camera poses at inference.

We also introduce a Field-of-View (FoV) Aware Sampling strategy during training to ensure that sampled frames actually share physical overlap of the object, preventing the model from learning ambiguous correspondences in wide-baseline scenarios.

Visual Comparison

Comparison on ScanNet++ and Replica datasets. 3AM consistently tracks objects under extreme viewpoint changes where baselines drift or lose the target.

Detailed Frame Comparisons

Detailed Comparison

Quantitative Results

Our method outperforms state-of-the-art methods significantly on the ScanNet++ dataset, especially in the Selected Subset designed to test robustness under reappearance and large viewpoint changes.

Method Whole Set Selected Subset
IoU ↑ Pos IoU ↑ Suc IoU ↑ IoU ↑ Pos IoU ↑ Suc IoU ↑
SAM2 0.4392 0.0235 0.0831 0.3397 0.0179 0.0395
SAM2Long 0.8233 0.4166 0.6855 0.7474 0.4133 0.6382
DAM4SAM 0.8205 0.4193 0.6783 0.7648 0.4356 0.6650
3AM (Ours) 0.8898 0.5630 0.7155 0.9061 0.7168 0.7737

Class-Agnostic 3D Instance Segmentation

By lifting our consistent 2D tracks to 3D, 3AM can perform 3D instance segmentation without explicit 3D supervision or complex merging algorithms.

3D Instance Segmentation