Published on Mon Apr 23 2018

To Create What You Tell: Generating Videos from Captions

Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, Tao Mei

Generative Adversarial Networks (GANs) can be used to generate videos. A video is a sequence of visually coherent and semantically dependent frames. The discriminator network consists of three discriminators: video discriminator classifying realistic videos from generated ones and optimizes video-caption matching.

0
0
0
Abstract

We are creating multimedia contents everyday and everywhere. While automatic content generation has played a fundamental challenge to multimedia community for decades, recent advances of deep learning have made this problem feasible. For example, the Generative Adversarial Networks (GANs) is a rewarding approach to synthesize images. Nevertheless, it is not trivial when capitalizing on GANs to generate videos. The difficulty originates from the intrinsic structure where a video is a sequence of visually coherent and semantically dependent frames. This motivates us to explore semantic and temporal coherence in designing GANs to generate videos. In this paper, we present a novel Temporal GANs conditioning on Captions, namely TGANs-C, in which the input to the generator network is a concatenation of a latent noise vector and caption embedding, and then is transformed into a frame sequence with 3D spatio-temporal convolutions. Unlike the naive discriminator which only judges pairs as fake or real, our discriminator additionally notes whether the video matches the correct caption. In particular, the discriminator network consists of three discriminators: video discriminator classifying realistic videos from generated ones and optimizes video-caption matching, frame discriminator discriminating between real and fake frames and aligning frames with the conditioning caption, and motion discriminator emphasizing the philosophy that the adjacent frames in the generated videos should be smoothly connected as in real ones. We qualitatively demonstrate the capability of our TGANs-C to generate plausible videos conditioning on the given captions on two synthetic datasets (SBMG and TBMG) and one real-world dataset (MSVD). Moreover, quantitative experiments on MSVD are performed to validate our proposal via Generative Adversarial Metric and human study.

Tue Dec 04 2018
Computer Vision
Conditional Video Generation Using Action-Appearance Captions
Most existing methods cannot control the contents of the generated video using a text caption. This particularly affects human videos due to their great variety of actions and appearances. This paper presents Conditional Flow and Texture GAN (CFT-GAN), a GAN-based video generation method.
0
0
0
Mon Jul 15 2019
Machine Learning
Adversarial Video Generation on Complex Datasets
Large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video simulations of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging
0
0
0
Thu Aug 13 2020
Computer Vision
Recurrent Deconvolutional Generative Adversarial Networks with Application to Text Guided Video Generation
0
0
0
Mon Aug 20 2018
Machine Learning
Video-to-Video Synthesis
Video-to-video synthesis problem is to learn a mapping function from an input source video to an output video that precisely depicts the content of the source video. Without understanding temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality.
0
0
0
Thu Nov 30 2017
Computer Vision
Improving Video Generation for Multi-functional Applications
In this paper, we aim to improve the state-of-the-art video generative networks with a view towards multi-functional applications. Our improved video GAN model does not separate foreground from background nor dynamic from static patterns, but learns to generate the entire video clip jointly.
0
0
0
Mon Dec 03 2018
Computer Vision
TwoStreamVAN: Improving Motion Modeling in Video Generation
Video generation requires modeling realistic temporal dynamics and spatial content. Existing methods struggle to simultaneously generate plausible motion and content. We propose a two-stream model that disentangles motion generation from content generation. Given an action label and a noise vector, our model creates clear and consistent motion.
0
0
0