Cross-Conditioned Recurrent Networks for Long-Term Synthesis of Inter-Person Human Motion Interactions
Modeling dynamics of human motion is one of the most challenging sequence modeling problem, with diverse applications in animation industry, human-robot interaction, motion-based surveillance, etc. Available attempts to use auto-regressive techniques for long-term single-person motion generation usually fails, resulting in stagnated motion or divergence to unrealistic pose patterns. In this paper, we propose a novel cross-conditioned recurrent framework targeting long-term synthesis of inter-person interactions beyond several minutes. We carefully integrate positive implications of both auto-regressive and encoder-decoder recurrent architecture, by interchangeably utilizing two separate fixed-length cross person motion prediction models for long-term generation in a novel hierarchical fashion. As opposed to prior approaches, we guarantee structural plausibility of 3D pose by training the recurrent model to regress latent representation of a separately trained generative pose embedding network. Different variants of the proposed frameworks are evaluated through extensive experiments on SBU-interaction, CMU-MoCAP and an inhouse collection of duet-dance dataset. Qualitative and quantitative evaluation on several tasks, such as Short-term motion prediction, Long-term motion synthesis and Interaction-based motion retrieval against prior state-of-the-art approaches clearly highlight superiority of the proposed framework.
PDF Abstract