1 code implementation • 14 Apr 2024 • Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, Wei Zeng
We conduct extensive experiments on both general and document-oriented MLLM benchmarks, and show that TextHawk outperforms the state-of-the-art methods, demonstrating its effectiveness and superiority in fine-grained document perception and general abilities.
1 code implementation • 5 Mar 2024 • Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang
To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action.
no code implementations • 14 Dec 2023 • Tao Hu, Honglong Zhang, Fan Zeng, Min Du, XiangKun Du, Yue Zheng, Quanqi Li, Mengran Zhang, Dan Yang, Jihao Wu
However, temporal and spatial dimensions are extremely critical in the logistics field, and this limitation may directly affect the precision of subsidy and pricing strategies.
no code implementations • 27 Oct 2023 • Chaowei Liu, Jichun Li, Yihua Teng, Chaoqun Wang, Nuo Xu, Jihao Wu, Dandan Tu
Thus, we propose DocStormer, a novel algorithm designed to restore multi-degraded colored documents to their potential pristine PDF.
no code implementations • 18 Dec 2022 • Ning Wang, Jiangrong Xie, Hang Luo, Qinglin Cheng, Jihao Wu, Mingbo Jia, Linlin Li
On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator.
no code implementations • 4 Dec 2022 • Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e. g., describing the image in a rough or detailed manner, in a factual or emotional view, etc.