Max-Shot Cross-Lingual Visual Natural Language Inference