Using Generative AI to Enable Robots to Reason and Act with ReMEmbR

Vision-language models (VLMs) combine the powerful language understanding of foundational LLMs with the vision capabilities of vision transformers (ViTs) by projecting text and images into the same embedding space. They can take unstructured multimodal data, reason over it, and return the output in a structured format. Building on a broad base of pretraining, they can be easily adapted for different vision-related tasks by providing new prompts or parameter-efficient fine-tuning.

They can also be integrated with live data sources and tools, to request more information if they don’t know the answer or take action when they do. LLMs and VLMs can act as agents, reasoning over data to help robots perform meaningful tasks that might be hard to define.

In a previous post, Bringing Generative AI to Life with NVIDIA Jetson, we demonstrated that you can run LLMs and VLMs on NVIDIA Jetson Orin devices, enabling a breadth of new capabilities like zero-shot object detection, video captioning, and text generation on edge devices.

But how can you apply these advances to perception and autonomy in robotics? What are the challenges you face when deploying these models into the field?

In this post, we discuss ReMEmbR, a project that combines LLMs, VLMs, and retrieval-augmented generation (RAG) to enable robots to reason and take actions over what they see during a long-horizon deployment, on the order of hours to days.

ReMEmbR’s memory-building phase uses VLMs and vector databases to efficiently build a long-horizon semantic memory. Then ReMEmbR’s querying phase uses an LLM agent to reason over that memory. It is fully open source and runs on-device.

ReMEmbR addresses many of the challenges faced when using LLMs and VLMs in a robotics application:

How to handle large contexts.
How to reason over a spatial memory.
How to build a prompt-based agent to query more data until a user’s question is answered.

To take things a step further, we also built an example of using ReMEmbR on a real robot. We did this using Nova Carter and NVIDIA Isaac ROS and we share the code and steps that we took. For more information, see the following resources: