no code implementations • 14 Mar 2024 • Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu
In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals.
1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin
We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.
1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin
We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.
no code implementations • 20 Oct 2023 • Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu
Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback.
no code implementations • 13 Oct 2023 • Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu
Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.
no code implementations • 20 Jul 2023 • Qi Zhang, Sipeng Zheng, Qin Jin
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.
1 code implementation • 12 Mar 2023 • Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin
In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
no code implementations • CVPR 2023 • Sipeng Zheng, Boshen Xu, Qin Jin
Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.
no code implementations • 10 Aug 2022 • Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.
no code implementations • CVPR 2022 • Sipeng Zheng, ShiZhe Chen, Qin Jin
Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.