no code implementations • 14 Jul 2023 • Zuozhuo Dai, Fangtao Shao, Qingkun Su, Zilong Dong, Siyu Zhu
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
no code implementations • 20 Jan 2023 • Zhenghao Zhang, Fangtao Shao, Zuozhuo Dai, Siyu Zhu
In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder.
1 code implementation • ECCV 2020 • Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, Zhiwei Yang
Violence detection has been studied in computer vision for years.