Instance-aware Image and Sentence Matching with Selective Multimodal LSTM

Effective image and sentence matching depends on how to well measure their global visual-semantic similarity. Based on the observation that such a global similarity arises from a complex aggregation of multiple local similarities between pairwise instances of image (objects) and sentence (words), we propose a selective multimodal Long Short-Term Memory network (sm-LSTM) for instance-aware image and sentence matching... (read more)

Results in Papers With Code
(↓ scroll down to see all results)