no code implementations • 12 Mar 2024 • Jiuniu Wang, Zehua Du, Yuyuan Zhao, Bo Yuan, Kexiang Wang, Jian Liang, Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo
In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow.
1 code implementation • 3 Feb 2024 • Wenjia Xu, Jiuniu Wang, Zhiwei Wei, Mugen Peng, Yirong Wu
Besides, pioneer ZSL models use convolutional neural networks pre-trained on ImageNet, which focus on the main objects appearing in each image, neglecting the background context that also matters in RS scene classification.
no code implementations • 12 Oct 2023 • Haoling Li, Jiuniu Wang, Zhiwei Wei, Wenjia Xu
Our GLVL network is a two-stage visual localization approach, combining a large-scale retrieval module that finds similar regions with the UAV flight scene, and a fine-grained matching module that localizes the precise UAV coordinate, enabling real-time and precise localization.
3 code implementations • 12 Aug 2023 • Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i. e., Stable Diffusion).
Ranked #8 on Text-to-Video Generation on MSR-VTT
4 code implementations • NeurIPS 2023 • Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou
The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis.
Ranked #5 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
no code implementations • 8 Aug 2022 • Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu
Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions.
no code implementations • 18 Jul 2022 • Wenjia Xu, Jiuniu Wang, Yirong Wu
In this paper, we propose a Multi-dimension Feature Learning Model~(MDFL) using high-dimensional GBD data in conjunction with RS images for urban region function recognition.
no code implementations • 8 Apr 2022 • Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
First, we propose a distinctiveness metric -- between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images.
no code implementations • 4 Apr 2022 • Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features.
Ranked #5 on GZSL Video Classification on ActivityNet-GZSL(main)
1 code implementation • CVPR 2022 • Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity, and further imposes their class discrimination and semantic relatedness.
no code implementations • 20 Aug 2021 • Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
In particular, we propose a group-based memory attention (GMA) module, which stores object features that are unique among the image group (i. e., with low similarity to objects in other images).
no code implementations • 29 Sep 2020 • Wenjia Xu, Jiuniu Wang, Yang Wang, Guangluan Xu, Wei Dai, Yirong Wu
We generate attribute-based textual explanations for the network and ground the attributes on the image to show visual explanations.
1 code implementation • 2 Sep 2020 • Jiuniu Wang, Wenjia Xu, Xingyu Fu, Guangluan Xu, Yirong Wu
Under such circumstances, how to make full use of the information extracted by word embedding requires more in-depth research.
1 code implementation • 2 Sep 2020 • Jiuniu Wang, Wenjia Xu, Xingyu Fu, Yang Wei, Li Jin, Ziyan Chen, Guangluan Xu, Yirong Wu
This model enhances the question answering system in the multi-document scenario from three aspects: model structure, optimization goal, and training method, corresponding to Multilayer Attention (MA), Cross Evidence (CE), and Adversarial Training (AT) respectively.
no code implementations • NeurIPS 2020 • Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
As an additional benefit, our model points to the visual evidence of the attributes in an image, e. g. for the CUB dataset, confirming the improved attribute localization ability of our image representation.
no code implementations • ECCV 2020 • Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
A wide range of image captioning models has been developed, achieving significant improvement based on popular metrics, such as BLEU, CIDEr, and SPICE.
no code implementations • 3 Sep 2018 • Jiuniu Wang, Xingyu Fu, Guangluan Xu, Yirong Wu, Ziyan Chen, Yang Wei, Li Jin
Meanwhile, we construct A3Net for the WebQA dataset.