TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Keyword Spotting	LRS2	Transpotter	Top-1 Accuracy	65	# 1
Visual Keyword Spotting	LRS2	Transpotter	Top-5 Accuracy	87.1	# 1
Visual Keyword Spotting	LRS2	Transpotter	mAP	69.2	# 1
Visual Keyword Spotting	LRS2	Transpotter	mAP IOU@0.5	68.3	# 1
Visual Keyword Spotting	LRS3-TED	Transpotter	Top-1 Accuracy	52	# 1
Visual Keyword Spotting	LRS3-TED	Transpotter	Top-5 Accuracy	77.1	# 1
Visual Keyword Spotting	LRS3-TED	Transpotter	mAP	55.4	# 1
Visual Keyword Spotting	LRS3-TED	Transpotter	mAP IOU@0.5	53.6	# 1
Visual Keyword Spotting	LRW	Transpotter	Top-1 Accuracy	85.8	# 1
Visual Keyword Spotting	LRW	Transpotter	Top-5 Accuracy	99.6	# 1
Visual Keyword Spotting	LRW	Transpotter	mAP	64.1	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-keyword-spotting-with-attention/visual-keyword-spotting-on-lrs2)](https://paperswithcode.com/sota/visual-keyword-spotting-on-lrs2?p=visual-keyword-spotting-with-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-keyword-spotting-with-attention/visual-keyword-spotting-on-lrs3-ted)](https://paperswithcode.com/sota/visual-keyword-spotting-on-lrs3-ted?p=visual-keyword-spotting-with-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/visual-keyword-spotting-with-attention/visual-keyword-spotting-on-lrw)](https://paperswithcode.com/sota/visual-keyword-spotting-on-lrw?p=visual-keyword-spotting-with-attention)`

Visual Keyword Spotting with Attention

29 Oct 2021 · K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman ·

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

PDF Abstract

Code

Add Remove Mark official

prajwalkr/transpotter official

Tasks

Add Remove

Lip Reading

Visual Keyword Spotting

Datasets

LRW

LRS2 LRS3-TED

Results from the Paper

Edit

Ranked #1 on Visual Keyword Spotting on LRS2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Keyword Spotting	LRS2	Transpotter	Top-1 Accuracy	65	# 1	Compare
			Top-5 Accuracy	87.1	# 1	Compare
			mAP	69.2	# 1	Compare
			mAP IOU@0.5	68.3	# 1	Compare
Visual Keyword Spotting	LRS3-TED	Transpotter	Top-1 Accuracy	52	# 1	Compare
			Top-5 Accuracy	77.1	# 1	Compare
			mAP	55.4	# 1	Compare
			mAP IOU@0.5	53.6	# 1	Compare
Visual Keyword Spotting	LRW	Transpotter	Top-1 Accuracy	85.8	# 1	Compare
			Top-5 Accuracy	99.6	# 1	Compare
			mAP	64.1	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Visual Keyword Spotting with Attention

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove