For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.
We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network.
We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy.
Ranked #32 on 3D Human Pose Estimation on 3DPW
Image diffusion models have been utilized in various tasks, such as text-to-image generation and controllable image synthesis.
Diffusion models have demonstrated great success in text-to-video (T2V) generation.
Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization.
In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA.
Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale.
While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale.
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving.