Search Results for author: Mohamed Omar

Found 6 papers, 0 papers with code

Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment

no code implementations • ICCV 2023 • Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh Sanan, Mohamed Omar

Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment.

Ranked #10 on Video Retrieval on MSR-VTT

Retrieval Text to Video Retrieval +2

Paper
Add Code

Selective Structured State-Spaces for Long-Form Video Understanding

no code implementations • CVPR 2023 • Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, Raffay Hamid

To address this limitation, we present a novel Selective S4 (i. e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos.

Ranked #2 on Video Classification on Breakfast

Contrastive Learning Token Reduction +2

Paper
Add Code

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

no code implementations • 19 Mar 2023 • Wentao Zhu, Mohamed Omar

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes.

Ranked #12 on Audio Classification on VGGSound

Audio Classification Representation Learning

Paper
Add Code

Dynamic Inference With Grounding Based Vision and Language Models

no code implementations • CVPR 2023 • Burak Uzkent, Amanmeet Garg, Wentao Zhu, Keval Doshi, Jingru Yi, Xiaolong Wang, Mohamed Omar

For example, recent image and language models with more than 200M parameters have been proposed to learn visual grounding in the pre-training step and show impressive results on downstream vision and language tasks.

Language Modelling Referring Expression +3

Paper
Add Code

AVT: Audio-Video Transformer for Multimodal Action Recognition

no code implementations • Submitted to ICLR 2022 • Wentao Zhu, Jingru Yi, Kevin Hsu, Xiaohang Sun, Xiang Hao, Linda Liu, Mohamed Omar

AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer.

Ranked #4 on Multi-modal Classification on VGG-Sound

Action Recognition Audio Classification +3

Paper
Add Code

Multiscale Multimodal Transformer for Multimodal Action Recognition

no code implementations • Submitted to ICLR 2022 • Wentao Zhu, Jingru Yi, Xiaohang Sun, Xiang Hao, Linda Liu, Mohamed Omar

In this work, we develop a multiscale multimodal Transformer (MMT) that employs hierarchical representation learning.

Ranked #1 on Multi-modal Classification on VGG-Sound

Action Recognition Audio Classification +2

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.