Focal Loss for Dense Object Detection

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.

PDF Abstract ICCV 2017 PDF ICCV 2017 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Long-tail Learning COCO-MLT Focal Loss(ResNet-50) Average mAP 49.46 # 8
Object Detection COCO-O RetinaNet (ResNet-50) Average mAP 16.6 # 39
Effective Robustness 0.18 # 33
Object Detection COCO test-dev RetinaNet (ResNeXt-101-FPN) box mAP 40.8 # 171
AP50 61.1 # 113
AP75 44.1 # 122
APS 24.1 # 97
APM 44.2 # 108
APL 51.2 # 122
Hardware Burden 4G # 1
Operations per network pass None # 1
Region Proposal COCO test-dev RPN+Focal Loss AR100 50.2 # 3
AR1000 60.9 # 3
ARL 67.5 # 3
ARM 58.2 # 3
ARS 33.9 # 2
AR300 56.6 # 2
Object Detection COCO test-dev RetinaNet (ResNet-101-FPN) box mAP 39.1 # 191
AP50 59.1 # 132
AP75 42.3 # 132
APS 21.8 # 122
APM 42.7 # 119
APL 50.2 # 131
Hardware Burden 4G # 1
Operations per network pass None # 1
Long-tail Learning EGTEA Focal loss (3D- ResNeXt101) Average Precision 59.09 # 3
Average Recall 59.17 # 3
2D Object Detection SARDet-100K RetinaNet box mAP 47.4 # 10
Pedestrian Detection TJU-Ped-campus RetinaNet R (miss rate) 34.73 # 5
RS (miss rate) 82.99 # 3
HO (miss rate) 71.31 # 3
R+HO (miss rate) 42.26 # 5
ALL (miss rate) 44.34 # 5
Pedestrian Detection TJU-Ped-traffic RetinaNet R (miss rate) 23.89 # 5
RS (miss rate) 37.92 # 4
HO (miss rate) 61.60 # 5
R+HO (miss rate) 28.45 # 4
ALL (miss rate) 41.40 # 5
Face Verification Trillion Pairs Dataset F-Softmax Accuracy 37.14 # 5
Face Identification Trillion Pairs Dataset F-Softmax Accuracy 39.80 # 5
Long-tail Learning VOC-MLT Focal Loss(ResNet-50) Average mAP 73.88 # 10

Results from Other Papers


Task Dataset Model Metric Name Metric Value Rank Source Paper Compare
Object Counting CARPK RetinaNet (2018) MAE 24.58 # 11
Dense Object Detection SKU-110K RetinaNet AP 45.5 # 5
AP75 .389 # 1

Methods