VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
2 PAPERS • NO BENCHMARKS YET
SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset.
1 PAPER • NO BENCHMARKS YET