Search Results for author: Zhan Tong

Found 18 papers, 13 papers with code

Contextual AD Narration with Interleaved Multimodal Sequence

no code implementations19 Mar 2024 Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, LiMin Wang

With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie.

Bootstrapping SparseFormers from Vision Foundation Models

1 code implementation4 Dec 2023 Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.

Advancing Vision Transformers with Group-Mix Attention

1 code implementation26 Nov 2023 Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo

The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value.

Image Classification object-detection +2

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

1 code implementation23 May 2023 Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.

Representation Learning

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

1 code implementation7 Apr 2023 Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou

In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.

Sparse Representation-based Classification Video Classification

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

no code implementations30 Mar 2023 Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo

We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.

Contrastive Learning Image Classification +6

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation CVPR 2023 LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

 Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition In Videos +3

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

2 code implementations26 May 2022 Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.

Action Recognition Video Recognition

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

4 code implementations23 Mar 2022 Zhan Tong, Yibing Song, Jue Wang, LiMin Wang

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.

4k Action Classification +3

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

1 code implementation16 Feb 2022 Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie

Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.

Efficient ViTs

EViT: Expediting Vision Transformers via Token Reorganizations

1 code implementation ICLR 2022 Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie

Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

1 code implementation ICCV 2021 Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu

First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.

Action Recognition Temporal Action Localization

Temporal Difference Networks for Action Recognition

no code implementations1 Jan 2021 LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Recognition In Videos

TDN: Temporal Difference Networks for Efficient Action Recognition

1 code implementation CVPR 2021 LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Classification Action Recognition In Videos

Cannot find the paper you are looking for? You can Submit a new open access paper.