TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Storytelling	VIST	ViT-model	BLEU-1	63	# 10
Visual Storytelling	VIST	ViT-model	BLEU-2	37.5	# 11
Visual Storytelling	VIST	ViT-model	BLEU-3	21.5	# 13
Visual Storytelling	VIST	ViT-model	BLEU-4	12.3	# 25
Visual Storytelling	VIST	ViT-model	METEOR	35.4	# 16
Visual Storytelling	VIST	ViT-model	CIDEr	4.4	# 29
Visual Storytelling	VIST	ViT-model	ROUGE-L	31	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-transformer-based-model-for-describing/visual-storytelling-on-vist)](https://paperswithcode.com/sota/visual-storytelling-on-vist?p=vision-transformer-based-model-for-describing)`

Vision Transformer Based Model for Describing a Set of Images as a Story

6 Oct 2022 · Zainy M. Malakan, Ghulam Mubashar Hassan, Ajmal Mian ·

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Language Modelling

Sentence

Visual Storytelling

Datasets

VIST

Results from the Paper

Edit

Ranked #25 on Visual Storytelling on VIST

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Storytelling	VIST	ViT-model	BLEU-1	63	# 10	Compare
			BLEU-2	37.5	# 11	Compare
			BLEU-3	21.5	# 13	Compare
			BLEU-4	12.3	# 25	Compare
			METEOR	35.4	# 16	Compare
			CIDEr	4.4	# 29	Compare
			ROUGE-L	31	# 2	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Vision Transformer Based Model for Describing a Set of Images as a Story

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove