We introduce Ilov3Splat, a novel framework for instance-level
open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on
2D rendering-based matching or point-level semantic association, which undermines cross-view consistency,
lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks.
To address these limitations, our method jointly optimizes scene geometry and semantic representations by
augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution
hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language
grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks,
supporting fine-grained object distinction across views.
At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D
clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects
in 3D scenes based on natural language descriptions, without requiring category supervision or manual
annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat
outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation,
offering a flexible and accurate solution for language-driven 3D scene understanding.