PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT), achieving the best accuracy compared to all baselines on the benchmark ViVQA dataset and other SOTA methods including SAAA and MCAN.
PDF AbstractDatasets
Results from the Paper
Submit
results from this paper
to get state-of-the-art GitHub badges and help the
community compare results to other papers.
Methods
Absolute Position Encodings •
Adam •
BPE •
Dense Connections •
Dropout •
Label Smoothing •
Layer Normalization •
Linear Layer •
LSTM •
Multi-Head Attention •
Position-Wise Feed-Forward Layer •
Residual Connection •
Scaled Dot-Product Attention •
Sigmoid Activation •
Softmax •
Tanh Activation •
Transformer