Published on Wed Apr 29 2015

Anticipating Visual Representations from Unlabeled Video

Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. We can train deep networks to predict the visual representations of images in the future. We then apply recognition algorithms on the predicted representation to anticipate objects and actions.

0
0
0
Abstract

Anticipating actions and objects before they start or appear is a difficult problem in computer vision with several real-world applications. This task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently learning this knowledge is through readily available unlabeled video. We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions. We experimentally validate this idea on two datasets, anticipating actions one second in the future and objects five seconds in the future.

Tue Apr 03 2018
Computer Vision
When will you do what? - Anticipating Temporal Occurrences of Activities
CNN and an RNN are trained to learn future video labels based on previously seen content. We show that our methods generate accurate predictions of the future.
0
0
0
Mon Aug 31 2020
Computer Vision
Future Frame Prediction of a Video Sequence
Predicting future frames of a video sequence has been a problem of high interest in the field of Computer Vision. A latent variable model often struggles to produce realistic results. An adversarially trained model underutilizes latent variables and thus fails to produce diverse predictions.
0
0
0
Thu Mar 26 2020
Computer Vision
Action Localization through Continual Predictive Learning
The problem of action recognition involves locating the action in the video, both over time and spatially in the image. The dominant current approaches use supervised learning to solve this problem. This approach does not require any training annotations in terms of frame-level bounding box annotations around the region of interest.
0
0
0
Mon Jun 10 2019
Computer Vision
Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping
Predictive coding theories suggest that the brain learns by predictingobservations at various levels of abstraction. One of the most basic prediction tasks is view prediction. This paper explores the role of view prediction in the development of 3D visual recognition. We propose contrastive prediction losses to replace the standard color regression loss.
0
0
0
Sat Aug 19 2017
Computer Vision
Visual Forecasting by Imitating Dynamics in Natural Sequences
We introduce a general framework for visual forecasting, which directly imitates visual sequences without additional supervision. As a result, our model can be applied at several semantic levels and does not require any domain knowledge or handcrafted features. At all levels, our approach outperforms existing methods.
0
0
0
Mon May 23 2016
Artificial Intelligence
Unsupervised Learning for Physical Interaction through Video Prediction
A core challenge for an agent learning to interact with the world is to predicting how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled information. To learn about physical object motion without labels, we develop an action-conditioned video prediction model.
0
0
0