Scene Understanding
516 papers with code • 3 benchmarks • 43 datasets
Scene Understanding is something that to understand a scene. For instance, iPhone has function that help eye disabled person to take a photo by discribing what the camera sees. This is an example of Scene Understanding.
Benchmarks
These leaderboards are used to track progress in Scene Understanding
Libraries
Use these libraries to find Scene Understanding models and implementationsDatasets
Subtasks
Latest papers
Volumetric Environment Representation for Vision-Language Navigation
To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Most Vision-and-Language Navigation (VLN) algorithms tend to make decision errors, primarily due to a lack of visual common sense and insufficient reasoning capabilities.
GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding
To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning.
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models.
Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation
This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation.
Embodied Understanding of Driving Scenarios
Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.
FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything
Therefore, this paper introduces FusionVision, an exhaustive pipeline adapted for the robust 3D segmentation of objects in RGB-D imagery.
One model to use them all: Training a segmentation model with complementary datasets
In this work, we propose a method to combine multiple partially annotated datasets, which provide complementary annotations, into one model, enabling better scene segmentation and the use of multiple readily available datasets.
Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding
Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision.
Semantically-aware Neural Radiance Fields for Visual Scene Understanding: A Comprehensive Review
This review thoroughly examines the role of semantically-aware Neural Radiance Fields (NeRFs) in visual scene understanding, covering an analysis of over 250 scholarly papers.