Published on Thu Jul 30 2020

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, Angjoo Kanazawa

We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene. Our key insight is that considering humans and objects jointly gives rise to "3D common sense" constraints that can be used to resolve ambiguity.

0
0
0
Abstract

We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to "3D common sense" constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively domain. The project webpage can be found at https://jasonyzhang.com/phosa.

Mon Jun 04 2018
Machine Learning
Digging Into Self-Supervised Monocular Depth Estimation
Self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. We show that a surprisingly simple model, and associated design choices, lead to superior predictions.
0
0
0
Mon Oct 23 2017
Neural Networks
Generic 3D Representation via Pose Estimation and Matching
A large body of computer vision research has investigated developing generic semantic representations, but efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3Drepresentation through solving a set of foundational proxy 3D tasks.
0
0
0
Sun Jul 26 2020
Computer Vision
OASIS: A Large-Scale Dataset for Single Image 3D in the Wild
Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D in the wild is data. We present Open annotations of Single Image Surfaces (OASIS)
0
0
0
Mon Mar 23 2020
Computer Vision
Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows
Monocular 3D human pose and shape estimation is challenging due to the many degrees of freedom of the human body. In this paper we show that the proposed methods outperform the state of the art.
0
0
0
Fri Dec 18 2020
Computer Vision
Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations
3D object detection has recently become popular due to many applications in robotics, augmented reality, autonomy, and image retrieval. We introduce the Objectron dataset to advance the state of the art in 3D object Detection. The dataset contains 4 million annotated images in 14,819
0
0
0
Sat Nov 25 2017
Computer Vision
Learning 3D Human Pose from Structure and Motion
3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose two anatomically inspired loss functions and use them with aweakly-supervised learning framework to jointly learn from
0
0
0