no code implementations • ICCV 2023 • Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Ming-Ming Cheng
By aggregating vision-language information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance.
no code implementations • 28 Nov 2022 • Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Bo Ren, Ming-Ming Cheng
By aggregating cross-modal information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance.
no code implementations • 22 Jan 2022 • Xing-Yu Chen, Qiu-Shi Zhu, Jie Zhang, Li-Rong Dai
By using the acoustic signals to train the network, respectively, we can build individual models for three tasks, whose parameters are averaged to obtain an average model, which is then used as the initialization for the BiLSTM model training of each task.