Search Results for author: Sipeng Zheng

Found 10 papers, 3 papers with code

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

no code implementations14 Mar 2024 Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu

In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals.

Quantization Visual Question Answering (VQA)

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

1 code implementation9 Mar 2024 Boshen Xu, Sipeng Zheng, Qin Jin

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.

SPAFormer: Sequential 3D Part Assembly with Transformers

1 code implementation9 Mar 2024 Boshen Xu, Sipeng Zheng, Qin Jin

We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

no code implementations20 Oct 2023 Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu

Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback.

LLaMA Rider: Spurring Large Language Models to Explore the Open World

no code implementations13 Oct 2023 Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu

Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.

Decision Making

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

no code implementations20 Jul 2023 Qi Zhang, Sipeng Zheng, Qin Jin

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.

Boundary Detection Video Grounding

Accommodating Audio Modality in CLIP for Multimodal Processing

1 code implementation12 Mar 2023 Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

AudioCaps Contrastive Learning +4

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

no code implementations CVPR 2023 Sipeng Zheng, Boshen Xu, Qin Jin

Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.

Human-Object Interaction Detection Language Modelling

Exploring Anchor-based Detection for Ego4D Natural Language Query

no code implementations10 Aug 2022 Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu

In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.

Video Understanding

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations CVPR 2022 Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Object Relation +3

Cannot find the paper you are looking for? You can Submit a new open access paper.