CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

14 Mar 2024  ยท  Yiming Ma, Victor Sanchez, Tanaya Guha ยท

The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered issues, including inappropriate discretization strategies, which impede the application of CLIP and result in suboptimal performance. To address these challenges, we propose the Enhanced Blockwise Classification (EBC) framework. In contrast to previous methods, EBC relies on integer-valued bins that facilitate the learning of robust decision boundaries. Within our model-agnostic EBC framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps. Comprehensive evaluations across diverse crowd-counting datasets demonstrate the state-of-the-art performance of our methods. Particularly, EBC can improve existing models by up to 76.9%. Moreover, our CLIP-EBC model surpasses current crowd-counting methods, achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part B datasets, respectively. The code will be made publicly available.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Crowd Counting NWPU-Crowd (Val) DMCount-EBC MAE 39.6 # 2
RMSE 95.8 # 2
Crowd Counting NWPU-Crowd (Val) CLIP-EBC (ResNet50) MAE 38.6 # 1
RMSE 90.3 # 1
Crowd Counting NWPU-Crowd (Val) CSRNet-EBC MAE 42.9 # 3
RMSE 100.1 # 3
Crowd Counting ShanghaiTech A DMCount-EBC MAE 62.3 # 13
RMSE 98.9 # 3
Crowd Counting ShanghaiTech A CSRNet-EBC MAE 66.3 # 16
RMSE 105.0 # 4
Crowd Counting ShanghaiTech A CLIP-EBC (ResNet50) MAE 55.0 # 5
RMSE 88.7 # 2
Crowd Counting ShanghaiTech B CLIP-EBC (ResNet50) MAE 6.3 # 2
RMSE 10.2 # 1
Crowd Counting ShanghaiTech B CSRNet-EBC MAE 6.9 # 8
RMSE 11.3 # 3
Crowd Counting ShanghaiTech B DMCount-EBC MAE 7.0 # 10
RMSE 10.9 # 2
Crowd Counting UCF-QNRF DMCount-EBC MAE 77.2 # 3
RMSE 130.4 # 2
Crowd Counting UCF-QNRF CSRNet-EBC MAE 79.3 # 4
RMSE 135.8 # 4
Crowd Counting UCF-QNRF DMCount-EBC (32, dynamic) MAE 76.06 # 2
RMSE 127.72 # 1
Crowd Counting UCF-QNRF DMCount-EBC (16, dynamic) MAE 75.90 # 1
RMSE 130.48 # 3
Crowd Counting UCF-QNRF CLIP-EBC (ResNet50) MAE 80.5 # 5
RMSE 136.6 # 5

Methods


CLIP โ€ข EBC