We propose a new model for speaker naming in movies that leverages visual, textual, and acoustic modalities in an unified optimization framework. To evaluate the performance of our model, we introduce a new dataset consisting of six episodes of the Big Bang Theory TV show and eighteen full movies covering different genres... (read more)
PDFMETHOD | TYPE | |
---|---|---|
![]() |
Working Memory Models |