In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.
Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI.
Sora is the first large-scale generalist video generation model that garnered significant attention across society.
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.
Recent advancements in diffusion models have positioned them at the forefront of image generation.
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
Ranked #8 on Visual Question Answering on MM-Vet
Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning.
To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.
Ranked #1 on Referring Expression Segmentation on Refer-YouTube-VOS (2021 public validation) (using extra training data)
Long-tail Video Object Segmentation Multi-Object Tracking +8
Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks.