Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

¹Queensland University of Technology, ²CSIRO Robotics

Abstract

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding.

Method

An overview of Ilov3Splat. Left: Our method learns language-aligned and instance-aware features for 3D Gaussians, computed via compact multi-resolution hash encoding and lightweight projection MLPs. Right: Feature learning is guided by multi-view 2D signals, leveraging CLIP for language alignment, DINO for object boundary regularization, and SAM for instance-aware contrastive learning.

Experiments

3D Object Selection

Qualitative results of 3D object selection on the LERF dataset.

3D Instance Segmentation

Qualitative results of category-agnostic 3D instance segmentation on the ScanNet dataset.

Videos

Feature Fields

RGB Rendering Feature Rendering

Interactive visualization of RGB rendering and the learned instance feature field of Ilov3Splat. Drag the slider to compare the two views.

Multi-view Consistency

Multi-view consistency of Ilov3Splat for 3D object selection on the LERF dataset.

Acknowledgements

This work was supported in part by the Australian Research Council Discovery Project under Grant DP250103634, and in part by the Commonwealth Scientific and Industrial Research Organisation (CSIRO). The authors acknowledge continued support from the CSIRO's Embodied AI Cluster.

BibTeX

@inproceedings{nguyen2026ilov3splat, author = {Nguyen, Binh Long and Nguyen, Kien and Sridharan, Sridha and Fookes, Clinton and Moghadam, Peyman}, title = {Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting}, booktitle = {International Conference on Pattern Recognition (ICPR)}, year = {2026}, }