The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that aptly describe the relationship between the image and the corresponding text. These annotations provide valuable insights into the semantic connection between each pair's visual and textual elements.
5 PAPERS • 2 BENCHMARKS
LLM-Seg40K dataset contains 14K images in total. The dataset is divided into training, validation, and test sets, containing 11K, 1K, and 2K images respectively. For the training split, each image has 3.95 questions on average and the average question question length is 15.2 words. The training set contains 1458 different categories in total.
1 PAPER • NO BENCHMARKS YET