Video-based human pose estimation in crowded scenes is a challenging problem
due to occlusion, motion blur, scale variation and viewpoint change, etc. Prior
approaches always fail to deal with this problem because of (1) lacking of
usage of temporal information; (2) lacking of training data in crowded scenes.
In this paper, we focus on improving human pose estimation in videos of crowded
scenes from the perspectives of exploiting temporal context and collecting new
data. In particular, we first follow the top-down strategy to detect persons
and perform single-person pose estimation for each frame. Then, we refine the
frame-based pose estimation with temporal contexts deriving from the
optical-flow. Specifically, for one frame, we forward the historical poses from
the previous frames and backward the future poses from the subsequent frames to
current frame, leading to stable and accurate human pose estimation in videos.
In addition, we mine new data of similar scenes to HIE dataset from the
Internet for improving the diversity of training set. In this way, our model
achieves best performance on 7 out of 13 videos and 56.33 average w\_AP on test
dataset of HIE challenge.