ParsVQA-Caps: A Benchmark for Visual Question Answering and Image Captioning in Persian
Despite recent advances in vision-and-language tasks, most progress is still focused on resource-rich languages such as English. Furthermore, widespread vision-and-language datasets directly adopt images representative of American or European cultures resulting in bias. Hence we introduce ParsVQA-Caps, the first benchmark in Persian for Visual Question Answering and Image Captioning tasks. We utilize two ways to collect datasets for each task, human-based and template-based for VQA and human-based and web-based for image captioning.
PDFDatasets
Introduced in the Paper:
ParsVQA-Caps