A Method for Curation of Web-Scraped Face Image Datasets

7 Apr 2020  ·  Kai Zhang, Vítor Albiero, Kevin W. Bowyer ·

Web-scraped, in-the-wild datasets have become the norm in face recognition research. The numbers of subjects and images acquired in web-scraped datasets are usually very large, with number of images on the millions scale. A variety of issues occur when collecting a dataset in-the-wild, including images with the wrong identity label, duplicate images, duplicate subjects and variation in quality. With the number of images being in the millions, a manual cleaning procedure is not feasible. But fully automated methods used to date result in a less-than-ideal level of clean dataset. We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods, with similar quality across men and women, to support comparison of accuracy across gender. Our approach removes near-duplicate images, merges duplicate subjects, corrects mislabeled images, and removes images outside a defined range of pose and quality. We conduct the curation on the Asian Face Dataset (AFD) and VGGFace2 test dataset. The experiments show that a state-of-the-art method achieves a much higher accuracy on the datasets after they are curated. Finally, we release our cleaned versions of both datasets to the research community.

PDF Abstract

Datasets


Introduced in the Paper:

Curated AFD

Used in the Paper:

LFW VGGFace2

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here