Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Scene Understanding

Source: https://dynamic.robots.ox.ac.uk/projects/lexis/ Parent: https://dynamic.robots.ox.ac.uk/projects/

Christina Kassab, Matias Mattamala, Lintong Zhang, Maurice Fallon

IEEE International Conference on Robotics and Automation (ICRA) 2024

Abstract Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system’s potential for planning.

Citation

@article{kassab2024lexis,
  author = {Kassab, Christina and Mattamala, Matias and Zhang, Lintong and Fallon, Maurice},
  title = {Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding},
  publisher = {IEEE Int. Conf. Robot. Autom.},
  year = {2024},
}

Acknowledgement We thank Nathan Hughes for providing the ground truth for the uHumans2 dataset.