Search Results for author: Zhan Tong

Found 18 papers, 13 papers with code

Contextual AD Narration with Interleaved Multimodal Sequence

no code implementations • 19 Mar 2024 • Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, LiMin Wang

With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie.

Paper
Add Code

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

1 code implementation • 21 Dec 2023 • Qinying Liu, Wei Wu, Kecheng Zheng, Zhan Tong, Jiawei Liu, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen

The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data.

Ranked #1 on Unsupervised Semantic Segmentation with Language-image Pre-training on COCO-Stuff-171

Attribute Open Vocabulary Semantic Segmentation +3

Paper
Code

Bootstrapping SparseFormers from Vision Foundation Models

1 code implementation • 4 Dec 2023 • Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.

Paper
Code

Advancing Vision Transformers with Group-Mix Attention

1 code implementation • 26 Nov 2023 • Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo

The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value.

Image Classification object-detection +2

102

Paper
Code

Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training

no code implementations • 25 Sep 2023 • Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong, Chongjian Ge, Pieter Abbeel, Yun-hui Liu

This work aims to improve unsupervised audio-visual pre-training.

Contrastive Learning Data Augmentation

Paper
Add Code

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.

Representation Learning

Paper
Code

Efficient Video Action Detection with Token Dropout and Context Refinement

2 code implementations • ICCV 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang

Our EVAD consists of two specialized designs for video action detection.

Action Detection

1,212

Paper
Code

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

1 code implementation • 7 Apr 2023 • Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou

In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.

Sparse Representation-based Classification Video Classification

Paper
Code

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

no code implementations • 30 Mar 2023 • Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo

We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.

Contrastive Learning Image Classification +6

Paper
Add Code

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition In Videos +3

396

Paper
Code

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

no code implementations • 28 Mar 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang

Existing studies model each actor and scene relation to improve action recognition.

Action Detection Action Recognition +2

Paper
Add Code

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

2 code implementations • 26 May 2022 • Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.

Action Recognition Video Recognition

290

Paper
Code

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

4 code implementations • 23 Mar 2022 • Zhan Tong, Yibing Song, Jue Wang, LiMin Wang

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.

Ranked #5 on Self-Supervised Action Recognition on HMDB51

4k Action Classification +3

125,059

Paper
Code

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

1 code implementation • 16 Feb 2022 • Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie

Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images.

Ranked #4 on Efficient ViTs on ImageNet-1K (with DeiT-S)

Efficient ViTs

155

Paper
Code

EViT: Expediting Vision Transformers via Token Reorganizations

1 code implementation • ICLR 2022 • Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie

155

Paper
Code

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

1 code implementation • ICCV 2021 • Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu

First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.

Action Recognition Temporal Action Localization

Paper
Code

Temporal Difference Networks for Action Recognition

no code implementations • 1 Jan 2021 • LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Recognition In Videos

Paper
Add Code

TDN: Temporal Difference Networks for Efficient Action Recognition

1 code implementation • CVPR 2021 • LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu

Ranked #17 on Action Recognition on Something-Something V1

Action Classification Action Recognition In Videos

362

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.