1 code implementation • 12 Jul 2023 • Gengyuan Zhang, Yurui Zhang, Kerui Zhang, Volker Tresp
This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human's capability in reasoning times and location.