1 code implementation • 2 Dec 2023 • Cheng-Fu Yang, Haoyang Xu, Te-Lin Wu, Xiaofeng Gao, Kai-Wei Chang, Feng Gao
In this paper, we aim to tackle this problem with a unified framework consisting of an end-to-end trainable method and a planning algorithm.
1 code implementation • 18 Oct 2023 • Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, Kai-Wei Chang
Additional analysis shows that the contrastive objective and meta-actions are complementary in achieving the best results, and the resulting agent better aligns its states with corresponding instructions, making it more suitable for real-world embodied agents.
no code implementations • 26 Nov 2022 • Wan-Cyuan Fan, Cheng-Fu Yang, Chiao-An Yang, Yu-Chiang Frank Wang
We tackle the problem of target-free text-guided image manipulation, which requires one to modify the input reference image based on the given text instruction, while no ground truth target image is observed during training.
no code implementations • 25 Sep 2022 • Cheng-Fu Yang, Yao-Hung Hubert Tsai, Wan-Cyuan Fan, Ruslan Salakhutdinov, Louis-Philippe Morency, Yu-Chiang Frank Wang
Since no ground truth captions are available for novel object images during training, our P2C leverages cross-modality (image-text) association modules to ensure the above caption characteristics can be properly preserved.
no code implementations • CVPR 2022 • Chiao-An Yang, Cheng-Yo Tan, Wan-Cyuan Fan, Cheng-Fu Yang, Meng-Lin Wu, Yu-Chiang Frank Wang
In particular, we propose a novel network of Scene Graph Transformer (SGT), which is designed to take node and edge features as inputs for modeling the associated structural information.
no code implementations • 29 Sep 2021 • Cheng-Fu Yang, Yao-Hung Hubert Tsai, Wan-Cyuan Fan, Yu-Chiang Frank Wang, Louis-Philippe Morency, Ruslan Salakhutdinov
Novel object captioning (NOC) learns image captioning models for describing objects or visual concepts which are unseen (i. e., novel) in the training captions.
1 code implementation • CVPR 2021 • Cheng-Fu Yang, Wan-Cyuan Fan, Fu-En Yang, Yu-Chiang Frank Wang
To better exploit the text input, so that implicit objects or relationships can be properly inferred during layout generation, we propose a LayoutTransformer Network (LT-Net) in this paper.
no code implementations • 1 Jan 2021 • Cheng-Fu Yang, Wan-Cyuan Fan, Fu-En Yang, Yu-Chiang Frank Wang
In the areas of machine learning and computer vision, text-to-image synthesis aims at producing image outputs given the input text.