SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning

Video captioning works on the two fundamental concepts, feature detection and feature composition. While modern day transformers are beneficial in composing features, they lack the fundamental problems of selecting and understanding of the contents... (read more)

Results in Papers With Code
(↓ scroll down to see all results)