By Abrar Anwar
Robots are increasingly expected to perceive and interact with their environments over extended periods. Robots are deployed for hours, if not days, at a time and they incidentally perceive different objects, events, and locations. For robots to understand and respond to questions that require complex multi-step reasoning in scenarios where the robot has been deployed for long periods, we built ReMEmbR, a retrieval-augmented memory for embodied robots. In this blog post, we will guide you through the main insights from ReMEmbR. Also, check out the NVIDIA blog post on ReMEmbR here.
ReMEmbR builds scalable long-horizon memory and reasoning systems for robots, which improve their capacity for perceptual question-answering and semantic action-taking. ReMEmbR consists of two phases: memory-building and querying.

Memory Building
When your robot has been deployed for hours or days, you need an efficient way of storing this information. Videos are easy to store, but hard to query and understand. Past and concurrent work has considered approaches like scene graphs, semantic maps, or queryable scene representations. However, as a robot’s memory begins to grow larger and larger over extended periods of time, memory building becomes difficult. You either have to re-update your memory as the scenes change or you have to be able to handle constant growth in your memory. In ReMEmbR, we take a simpler approach: simply use a vector database.
During memory building, we take short segments of video, caption them with the NVIDIA VILA captioning VLM, and then embed them into a MilvusDB vector database. We also store timestamps and coordinate information from the robot in the vector database.
This setup enabled us to efficiently store and query all kinds of information from the robot’s memory. By capturing video segments with VILA and embedding them into a MilvusDB vector database, the system can remember anything that VILA can capture, from dynamic events such as people walking around and specific small objects, all the way to more general categories. Using a vector database makes it easy to add new kinds of information for ReMEmbR to take into consideration.
ReMEmbR Agent
Given such a long memory stored in the database, a standard LLM would struggle to reason quickly over the long context. As such, we build an LLM agent that is able to iteratively query its memory over different kinds of information to answer a user’s question. The LLM backend for the ReMEmbR agent can be NVIDIA NIM microservices, local on-device LLMs, or other LLM APIs. In our paper, we use GPT-4o.
When a user poses a question, the LLM generates queries to the database, retrieving relevant information iteratively. The LLM can query for text information, time information, or position information depending on what the user is asking. This process repeats until the question is answered. Our use of these different tools for the LLM agent enables the robot to go beyond answering questions about how to go to specific places and enables reasoning spatially and temporally.
Below is a GIF of the reasoning process that ReMEmbR could take.

Evaluation Dataset and Results
We introduce the NaVQA dataset, which is composed of 210 examples across three different time ranges up to 20 minutes in length. The dataset consists of spatial, temporal, and descriptive questions, each of which has different types of outputs as shown below.

We compare ReMEmbR to an approach that processes all captions at once and another that processes all frames at once. We find that GPT4o-based approaches perform the best, and that ReMEmbR outperforms the LLM-based method and remains competitive to the VLM-based approach on the Short videos. The Medium and Long videos are too long for the VLM to process, and thus is marked with an ✗.

Deploying ReMEmbR on a Real Robot
To demonstrate how ReMEmbR can be integrated into a real robot, we built a demo using ReMEmbR with NVIDIA Isaac ROS and Nova Carter. The library is open-source and can be found here and does not require ROS to use! We have detailed instructions on getting ReMEmbR to work with AMCL-based localization that should transfer to most robots in the NVIDIA blog post here.
In the demo, the robot answers questions and guides people around an office environment. To build the robot’s memory, we simply drove the robot around the NVIDIA office building for ~25 minutes without prompting it to watch for specific objects. Then, when we queried it with open-ended questions, we found some impressive results!
Concluding Remarks
I think ReMEmbR is a good first step towards building memory representations for robots that will exist in our homes and offices for hours to days at a time. The representation is not perfect. For example, ReMEmbR does not have a sense of “closeness” and its answers are the first place it saw a specific object, not the closest. This flaw makes our approach not particularly useful for mobile manipulation. However, other forms of spatial memory are tricky with updating changes to the environment and temporal memory is also often ignored, but ReMEmbR makes these easy.
Since the release of ReMEmbR, there have been other concurrent works that consider memory for navigation robots. Embodied-RAG integrates a topological map and a semantic forest, along with retrieval to build a spatial memory. H-EMV builds a memory over time and uses Python code to query over the memory in an intelligent manner, which is more structured than our function calling. RONAR uses captioning to build a narration of the robot’s experience over time, which they use for various downstream tasks. More structured approaches like DynaMem are also pretty exciting! We’re excited to see where this field goes!
We will be presenting ReMEmbR at ICRA 2025! For more information about ReMEmbR, check out our website here and GitHub here.