Perceiving the world using diffusion models

with Dongwon Son, Jaehyung Kim, and Sanghyeon Son

The object estimation problem

Think of a robot arm doing jobs such as households or logistics. A robot needs to interact with an object, such as picking it up and safely placing it somewhere else. It seems easy to a human, but why still can’t we see such robots around us?

Untitled

During the last decade, deep learning enabled computer vision such as object detection and semantic segmentation. However, to a robot, the world is not just an assortment of static images but a dynamic environment full of objects that need to be grasped, moved, or manipulated. The robot’s ability to perceive and track objects with their shapes and locations together with the ****understanding of physics is the key.

Our goal is to develop a robust real-time perception algorithm and transition model with such capabilities for diverse objects. However, the hurdle to achieving this is partial observability. The information from the sensor is often incomplete and noisy. Thus, the robot’s estimate must be able to express the uncertainty.

Our approach

We build our approach on the question of what representation should an object have. In our previous works BRAX-LOCC[1], we proposed an analytical physics engine that directly operates on latent neural shape representation. This enables the use of a computation-efficient transition model without reconstructing the shape into an explicit representation such as mesh. We further extend the work of [1] to more manipulable representation adopting the methods from geometric deep learning[2, 3].

On top of this representation, we build the perception model for vision observations. The diffusion model[4, 5] was a recent breakthrough in the field of generative modeling. More recently, it turned out that its ability is not restricted to image generation, but also to 3D shape modeling[6]. We incorporate the idea of shape refinement and stochastic behavior into the 3D object detection[7] for the estimation of the representation from the vision observations.

This video shows the preliminary results with the primitive shapes, and it can run in real-time while predicting the shape and pose of the objects with multiple hypotheses. Furthermore, the warm-start trick of the diffusion model enables tracking across time sequences.

https://youtu.be/0SJfSS2KqI8?si=jz_BPqj4nBczKNvK

This perception is based on the representation tailored for robotics tasks such as pick-and-place, so we can easily execute tasks in the real world with them. For example, we demonstrated that we can perform pick-and-place including grasp prediction with this perception module. The interesting benefit of our system is that it supports transparent objects because the perception is solely based on the RGB images from multiple viewpoints.

Untitled

media2.mp4

media1.mp4

What’s left?

We aim for the best practicality of our approach to real robot problems. We summarize our remaining challenges into three.

Shape quality Robot manipulation tasks often require very precise motions. For example, think of grasping a mug which usually consists of a thin sidewall and a handle. The low expressiveness of the shape representation may miss such high-frequency details, being the bottleneck of the entire robot system.
Transition, world model We plan to implement a transition model which supports rich contact cases[8].
Planning with stochasticity Our perception module can provide multiple hypotheses by considering partial observability. We want to develop a planning algorithm that can consider this stochasticity[9].

You will:

Train and evaluate your model.
Deploy your model into the real robot experiment
Highly engage in the discussion to improve our method and design of downstream tasks

You will learn:

Basic robotics including motion planning and grasping
Basic computer vision / 3D vision
Programming with JAX
Generative model
Geometric deep learning