Video Alignment
21 papers with code • 2 benchmarks • 4 datasets
Latest papers with no code
Listen Then See: Video Alignment with Speaker Attention
Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality.
AniClipart: Clipart Animation with Text-to-Video Priors
To generate cartoon-style and smooth motion, we first define B\'{e}zier curves over keypoints of the clipart image as a form of motion regularization.
Scaling Up Video Summarization Pretraining with Large Language Models
Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem.
The Effects of Short Video-Sharing Services on Video Copy Detection
From the experimental results focusing on segment-level and video-level situations, we can see that three effects: "Segment-level VCD in short video-sharing services is more difficult than those in general video-sharing services", "Video-level VCD in short video-sharing services is easier than those in general video-sharing services", "The video alignment component mainly suppress the detection performance in short video-sharing services".
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
To this end, this paper proposes a novel text-guided video inpainting model that achieves better consistency, controllability and compatibility.
FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing
By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time.
Towards A Better Metric for Text-to-Video Generation
Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world.
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment.
ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction with Multimodal Transformer
However, most previous works treat the live as a whole item and explore the Click-through-Rate (CTR) prediction framework on item-level, neglecting that the dynamic changes that occur even within the same live room.