Self-supervised Discovery of Human Actons from Long Kinematic Videos

29 Sep 2021 · Kenneth Li, Xiao Sun, Zhirong Wu, Fangyun Wei, Stephen Lin ·

For human action understanding, a popular research direction is to analyze short video clips with unambiguous semantic content, such as jumping and drinking. However, methods for understanding short semantic actions cannot be directly translated to long kinematic sequences such as dancing, where it becomes challenging even to semantically label the human movements. To promote analysis of long videos of complex human motions, we propose a self-supervised method for learning a representation of such motion sequences that is similar to words in a sentence, where videos are segmented and clustered into recurring temporal patterns, called actons. Our approach first obtains a frame-wise representation by contrasting two augmented views of video frames conditioned on their temporal context. The frame-wise representations across a collection of videos are then clustered by K-means. Actons are then automatically extracted by forming a continuous motion sequence from frames within the same cluster. We evaluate the self-supervised representation by temporal alignment metrics, and the clustering results by normalized mutual information and language entropy. We also study an application of this tokenization by using it to classify dance genres. On AIST++ and PKU-MMD datasets, actons are shown to bring significant performance improvements compared to several baselines.

PDF Abstract