Summary
Tasks requiring two-hand coordination and precise, fine-grained manipulation are challenging for current robotic systems. We propose a sample-efficient, language-conditioned, voxel-based method that utilizes Vision Language Models (VLMs) to prioritize key regions within the scene, which reduces the cost of processing voxels, and reconstruct a voxel grid for bimanual manipulation policies.
Bimanual manipulation is essential for robots to manipulate objects as well as humans. It becomes necessary when objects are too large to be controlled by one hand or when using one hand to stabilize an object makes it easier for the other hand to manipulate it. In this work, we focus on robotic bimanual manipulation that is “asymmetric”. Here, “asymmetry” refers to the functions of the two arms: one is a stabilizing arm, while the other is the acting arm. Asymmetric tasks are common in household and industrial settings, such as cutting food, opening bottles, and packaging boxes. They typically require two-hand coordination and high-precision, fine-grained manipulation, which are challenging for current robotic manipulation systems.
In contrast to single-arm manipulation, bimanual manipulation tasks are challenging due to higher-dimensional action spaces. To tackle bimanual manipulation, prior works train policies on large datasets or exploit primitive actions (Figure 1). However, they are generally sample inefficient, and using primitives can hinder generalization to different tasks as they are not easily adaptable to other types of tasks. To overcome these issues, we propose VoxAct-B (Figure 2), a language-conditioned, voxel-based method that leverages Vision Language Models (VLMs) to prioritize key regions within the robot’s field-of-view and construct a voxel grid. We provide this voxel grid to our bimanual manipulation policy to learn acting and stabilizing actions. This approach enables more efficient policy learning from voxels and is generalizable to different tasks.
Voxel Representation for Robotic Manipulation
Voxels are the three-dimensional equivalent of pixels, representing points in a 3D space. One of the most prominent works that uses voxels for robotic manipulation is Perceiver-Actor (PerAct). PerAct is a multi-task, language-conditioned behavior cloning (BC) agent capable of learning to imitate various 6-DoF manipulation tasks with only a few demonstrations per task. It encodes language goals and voxels with a Perceiver Transformer and outputs the discretized pose of the next best voxel within a 3D spatial action map with the same input voxel dimensions. PerAct is known to be a sample-efficient method. It learns to do 7 real-world tasks with just 53 demonstrations. Its problem formulation, specifically the discretized action space with cross-entropy loss, allows for modeling multi-modal actions.
However, the main criticism of PerAct is that processing voxels is computationally demanding, so follow-up works, like RVT and Act3D, reduce the computational cost of PerAct by avoiding voxel representations, but they often need multiple views of the scene to achieve optimal performance, and they may be less interpretable compared to a voxel grid. In this work, we retain the spatial equivariance (transformations of the input lead to corresponding transformations of the output) benefits of voxel representations but reduce the cost of processing voxels by “zooming” into part of the voxel grid.
Method
VoxAct-B takes RGB-D images, two language goals , and proprioception data of two robot arms as input. We input an RGB image from the front camera and a text query extracted from the language goals into the Vision Language Models (VLMs).
denotes the left arm as acting and the right arm as stabilizing, and vice versa for
.
Vision Language Models
We use a two-stage approach for our VLMs. We input a text query and an RGB image from the front camera to OWL-ViT, an open-vocabulary object detector, to detect the object. Then, we use Segment Anything, a foundational image segmentation model, to obtain the segmentation mask of the object and use the mask’s centroid along with point cloud data, obtained from the front camera’s RGB-D image, to retrieve the object’s pose with respect to the front camera. We use this information to determine the task-specific roles of each arm and the language goal. For example, in the drawer tasks, we use the drawer’s pose relative to the front camera to determine which robot arm the drawer is facing. If it faces the left robot arm, is selected because the orientation gives the left acting arm a better angle for opening the drawer, and vice versa for the right robot arm.
We use the object’s position with the RGB-D images to reconstruct a voxel grid that spans meters of the workspace using
voxels, where
is a fraction that determines the size of the crop and
is the number of meters that spans the workspace. This allows zooming into the more important region of interest. Figure 5 illustrates the effects of
on the voxel resolution with the same number of voxels. An appropriate value of
(e.g., 0.3) allows the voxel grid to achieve high resolution, which provides sufficient detail of the scene for accurate, fine-grained bimanual manipulation.
Bimanual Manipulation Policies
The zoomed-in voxel grid, the language goal, proprioception data of both robot arms, and an arm ID are provided to an acting policy and a stabilizing policy, which are based on the PerAct architecture. There are two important modifications we made to PerAct to perform bimanual manipulation. Instead of training two separate policies for the two arms, we exploit the discretized action space that predicts the next best voxel by formulating a system that uses acting and stabilizing policies. This policy formulation enables more efficient learning from multi-modal demonstrations compared to a joint-space control policy. Figure 6 shows a valid next-best voxel, denoted as a red dot, from the acting policy for both left and right arms. Another modification is that we added an arm ID head to predict arm ID, which allows policies to learn to map the appropriate acting or stabilizing actions to a given arm. Furthermore, the acting and stabilizing policies predict the discretized pose of the next best voxel, gripper open action, collision avoidance flag, and arm ID for bimanual manipulation.
Experiments and Results
In simulation, we compare against several strong baseline methods: Action Chunking with Transformers (ACT), Diffusion Policy, and VoxPoser. Each method is trained on 10 or 100 demonstrations, where half are left-acting and right-stabilizing demonstrations, and the other half are right-acting and left-stabilizing demonstrations. Methods are evaluated on 25 episodes of unseen test data. Environment variations include object spawn locations, sizes, orientations, and colors. Our method outperforms all baselines by a large margin in a low-data regime. Even with additional training data, it continues to surpass all baselines.
We also conducted real-world experiments on Open Jar (Figure 9) and Open Drawer (Figure 10). For Open Jar, we train VoxAct-B on 10 left-acting, right-stabilizing and 10 right-acting, left-stabilizing demonstrations. It succeeds in 5 out of 10 trials, demonstrating its ability to learn from multi-modal, real-world data. For Open Drawer, we train the policy on 10 right-acting, left-stabilizing demonstrations. It succeeds in 6 out of 10 trials.
The most common failure cases of VoxAct-B result from imprecise object grasping. Occasionally, VLMs fail to detect or segment objects, causing VoxAct-B to perform undesirable actions. These issues could be mitigated by increasing the number of voxels and using better VLMs.
For future work, we would like to extend this method to do tasks that are long-horizon and multi-modal. For more information, please visit our website: https://voxact-b.github.io. We will be presenting VoxAct-B at the Conference on Robot Learning (CoRL) 2024.