Sora is the first large-scale generalist video generation model that garnered significant attention across society.
Recent advancements in diffusion models have positioned them at the forefront of image generation.
To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.
Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks.
Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning.
Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime.
Ranked #1 on Feature Upsampling on ImageNet
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets.
Ranked #1 on Image Generation on ImageNet 512x512
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.
Ranked #1 on Referring Expression Segmentation on Refer-YouTube-VOS (2021 public validation) (using extra training data)
Long-tail Video Object Segmentation Multi-Object Tracking +8
In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning.
The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective.