Visual Keyword Spotting with Attention

29 Oct 2021  ยท  K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman ยท

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Visual Keyword Spotting LRS2 Transpotter Top-1 Accuracy 65 # 1
Top-5 Accuracy 87.1 # 1
mAP 69.2 # 1
mAP IOU@0.5 68.3 # 1
Visual Keyword Spotting LRS3-TED Transpotter Top-1 Accuracy 52 # 1
Top-5 Accuracy 77.1 # 1
mAP 55.4 # 1
mAP IOU@0.5 53.6 # 1
Visual Keyword Spotting LRW Transpotter Top-1 Accuracy 85.8 # 1
Top-5 Accuracy 99.6 # 1
mAP 64.1 # 1

Methods


No methods listed for this paper. Add relevant methods here