Trending Research

MambaOut: Do We Really Need Mamba for Vision?

yuweihao/mambaout • • 13 May 2024

For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks.

Image Classification Instance Segmentation +2

1,519

5.77 stars / hour

Paper
Code

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

tencent/hunyuandit • • 14 May 2024

For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.

Image Generation Language Modelling +2

1,565

4.48 stars / hour

Paper
Code

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

idea-research/grounding-dino-1.5-api • 16 May 2024

Empirical results demonstrate the effectiveness of Grounding DINO 1. 5, with the Grounding DINO 1. 5 Pro model attaining a 54. 3 AP on the COCO detection benchmark and a 55. 7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection.

Edge-computing object-detection +1

215

3.07 stars / hour

Paper
Code

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

opengvlab/internvl • • 25 Apr 2024

Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.

Ranked #6 on Visual Question Answering on MM-Vet

4k Language Modelling +3

2,428

2.41 stars / hour

Paper
Code

A decoder-only foundation model for time-series forecasting

google-research/timesfm • • 14 Oct 2023

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset.

Decoder Time Series +1

2,234

1.52 stars / hour

Paper
Code

How Far Are We From AGI

ulab-uiuc/agi-survey • 16 May 2024

The evolution of artificial intelligence (AI) has profoundly impacted human society, driving significant advancements in multiple sectors.

140

1.50 stars / hour

Paper
Code

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

x-lance/anitalker • • 6 May 2024

The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait.

Metric Learning Self-Supervised Learning

879

1.47 stars / hour

Paper
Code

Sakuga-42M Dataset: Scaling Up Cartoon Research

zhenglinpan/SakugaDataset • 13 May 2024

Can we harness the success of the scaling paradigm to benefit cartoon research?

Ranked #1 on Video to Text Retrieval on Sakuga-42M

Text to Video Retrieval Video to Text Retrieval

210

1.38 stars / hour

Paper
Code

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

alpha-vllm/lumina-t2x • • 9 May 2024

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.

1,055

1.22 stars / hour

Paper
Code

KAN: Kolmogorov-Arnold Networks

Blealtan/efficient-kan • • 30 Apr 2024

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs).

2,576

1.12 stars / hour

Paper
Code