Search Results for author: Xiangyu Wu

Found 14 papers, 3 papers with code

The Solution for the CVPR2024 NICE Image Captioning Challenge

no code implementations19 Apr 2024 Longfei Huang, Shupeng Zhong, Xiangyu Wu, Ruoxuan Li, QingGuo Chen, Yang Yang

Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts.

Image Captioning Retrieval

Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023

no code implementations10 Oct 2023 Xiangyu Wu, Yang Yang, Shengdong Xu, Yifeng Wu, QingGuo Chen, Jianfeng Lu

At the data level, inspired by the challenge paper, we categorized the whole questions into eight types and utilized the llama-2-chat model to directly generate the type for each question in a zero-shot manner.

object-detection Object Detection +3

ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction with Multimodal Transformer

no code implementations26 Jun 2023 Jiaxin Deng, Dong Shen, Shiyao Wang, Xiangyu Wu, Fan Yang, Guorui Zhou, Gaofeng Meng

However, most previous works treat the live as a whole item and explore the Click-through-Rate (CTR) prediction framework on item-level, neglecting that the dynamic changes that occur even within the same live room.

Click-Through Rate Prediction Dynamic Time Warping +1

CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval

no code implementations15 Apr 2023 Yang Yang, Zhongtian Fu, Xiangyu Wu, Wenjie Li

To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations. To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks.

Cross-Modal Retrieval Instance Search +1

Generation-Guided Multi-Level Unified Network for Video Grounding

no code implementations14 Mar 2023 Xing Cheng, Xiangyu Wu, Dong Shen, Hezheng Lin, Fan Yang

Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.

Video Grounding

A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

no code implementations19 Nov 2022 Jiaxin Deng, Dong Shen, Haojie Pan, Xiangyu Wu, Ximan Liu, Gaofeng Meng, Fan Yang, Size Li, Ruiji Fu, Zhongyuan Wang

Furthermore, based on this dataset, we propose an end-to-end model that jointly optimizes the video understanding objective with knowledge graph embedding, which can not only better inject factual knowledge into video understanding but also generate effective multi-modal entity embedding for KG.

Common Sense Reasoning Knowledge Graph Embedding +4

Learning a Single Near-hover Position Controller for Vastly Different Quadcopters

no code implementations19 Sep 2022 Dingqi Zhang, Antonio Loquercio, Xiangyu Wu, Ashish Kumar, Jitendra Malik, Mark W. Mueller

This paper proposes an adaptive near-hover position controller for quadcopters, which can be deployed to quadcopters of very different mass, size and motor constants, and also shows rapid adaptation to unknown disturbances during runtime.

Drone Controller Position

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

2 code implementations9 Sep 2021 Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen

In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity.

Ranked #9 on Video Retrieval on MSVD (using extra training data)

Retrieval Text Retrieval +1

Real-time Geo-localization Using Satellite Imagery and Topography for Unmanned Aerial Vehicles

no code implementations7 Aug 2021 Shuxiao Chen, Xiangyu Wu, Mark W. Mueller, Koushil Sreenath

The capabilities of autonomous flight with unmanned aerial vehicles (UAVs) have significantly increased in recent times.

Image-Based Localization

CAT: Cross Attention in Vision Transformer

1 code implementation10 Jun 2021 Hezheng Lin, Xing Cheng, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan Wang, Qing Song, Wei Yuan

In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capture local information and apply attention between image patches which are divided from single-channel feature maps capture global information.

Cannot find the paper you are looking for? You can Submit a new open access paper.