Google DeepMind unveils its first “thinking” robotics AI

Generative AI systems that create text, images, audio, and even video are becoming commonplace. In the same way AI models output those data types, they can also be used to output robot actions. That's the foundation of Google DeepMind's Gemini Robotics project, which has announced a pair of new models that work together to create the first robots that "think" before acting. Traditional LLMs have their own set of problems, but the introduction of simulated reasoning did significantly upgrade their capabilities, and now the same could be happening with AI robotics.

The team at DeepMind contends that generative AI is a uniquely important technology for robotics because it unlocks general functionality. Current robots have to be trained intensively on specific tasks, and they are typically bad at doing anything else. "Robots today are highly bespoke and difficult to deploy, often taking many months in order to install a single cell that can do a single task," said Carolina Parada, head of robotics at Google DeepMind.

The fundamentals of generative systems make AI-powered robots more general. They can be presented with entirely new situations and workspaces without needing to be reprogrammed. DeepMind's current approach to robotics relies on two models: one that thinks and one that does.

The two new models are known as Gemini Robotics 1.5 and Gemini Robotics-ER 1.5. The former is a vision-language-action (VLA) model, meaning it uses visual and text data to generate robot actions. The "ER" in the other model stands for embodied reasoning. This is a vision-language model (VLM) that takes visual and text input to generate the steps needed to complete a complex task.

The thinking machines

Gemini Robotics-ER 1.5 is the first robotics AI capable of simulated reasoning like modern text-based chatbots—Google likes to call this "thinking," but that's a bit of a misnomer in the realm of generative AI. DeepMind says the ER model achieves top marks in both academic and internal benchmarks, which shows that it can make accurate decisions about how to interact with a physical space. It doesn't undertake any actions, though. That's where Gemini Robotics 1.5 comes in.

Imagine that you want a robot to sort a pile of laundry into whites and colors. Gemini Robotics-ER 1.5 would process the request along with images of the physical environment (a pile of clothing). This AI can also call tools like Google search to gather more data. The ER model then generates natural language instructions, specific steps that the robot should follow to complete the given task.

Gemini Robotics 1.5 (the action model) takes these instructions from the ER model and generates robot actions while using visual input to guide its movements. But it also goes through its own thinking process to consider how to approach each step. "There are all these kinds of intuitive thoughts that help [a person] guide this task, but robots don't have this intuition," said DeepMind's Kanishka Rao. "One of the major advancements that we've made with 1.5 in the VLA is its ability to think before it acts."

Both of DeepMind's new robotic AIs are built on the Gemini foundation models but have been fine-tuned with data that adapts them to operating in a physical space. This approach, the team says, gives robots the ability to undertake more complex multi-stage tasks, bringing agentic capabilities to robotics.

The DeepMind team tests Gemini robotics with a few different machines, like the two-armed Aloha 2 and the humanoid Apollo. In the past, AI researchers had to create customized models for each robot, but that's no longer necessary. DeepMind says that Gemini Robotics 1.5 can learn across different embodiments, transferring skills learned from Aloha 2's grippers to the more intricate hands on Apollo with no specialized tuning.

All this talk of physical agents powered by AI is fun, but we're still a long way from a robot you can order to do your laundry. Gemini Robotics 1.5, the model that actually controls robots, is still only available to trusted testers. However, the thinking ER model is now rolling out in Google AI Studio, allowing developers to generate robotic instructions for their own physically embodied robotic experiments.