Search Results for author: Fajri Koto

Found 30 papers, 18 papers with code

Cloze Evaluation for Deeper Understanding of Commonsense Stories in Indonesian

1 code implementation • CSRR (ACL) 2022 • Fajri Koto, Timothy Baldwin, Jey Han Lau

Story comprehension that involves complex causal and temporal relations is a critical task in NLP, but previous studies have focused predominantly on English, leaving open the question of how the findings generalize to other languages, such as Indonesian.

Cloze Test Sentence +1

Paper
Code

Handling Variance of Pretrained Language Models in Grading Evidence in the Medical Literature

no code implementations • ALTA 2021 • Fajri Koto, Biaoyan Fang

In this paper, we investigate the utility of modern pretrained language models for the evidence grading system in the medical literature based on the ALTA 2021 shared task.

Paper
Add Code

Easy-First Bottom-Up Discourse Parsing via Sequence Labelling

no code implementations • COLING (CODI, CRAC) 2022 • Andrew Shen, Fajri Koto, Jey Han Lau, Timothy Baldwin

We propose a novel unconstrained bottom-up approach for rhetorical discourse parsing based on sequence labelling of adjacent pairs of discourse units (DUs), based on the framework of Koto et al. (2021).

Discourse Parsing

Paper
Add Code

Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?

no code implementations • ECNLP (ACL) 2022 • Fajri Koto, Jey Han Lau, Timothy Baldwin

For any e-commerce service, persuasive, faithful, and informative product descriptions can attract shoppers and improve sales.

Text Generation

Paper
Add Code

LipKey: A Large-Scale News Dataset for Absent Keyphrases Generation and Abstractive Summarization

no code implementations • COLING 2022 • Fajri Koto, Timothy Baldwin, Jey Han Lau

Summaries, keyphrases, and titles are different ways of concisely capturing the content of a document.

Abstractive Text Summarization Document Summarization

Paper
Add Code

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

no code implementations • 9 Apr 2024 • Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, Dea Annisayanti Putri, Bryan Wilie, Genta Indra Winata, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes.

Paper
Add Code

IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces

no code implementations • 2 Apr 2024 • Fajri Koto, Rahmad Mahendra, Nurul Aisyah, Timothy Baldwin

Although commonsense reasoning is greatly shaped by cultural and geographical factors, previous studies on language models have predominantly centered on English cultures, potentially resulting in an Anglocentric bias.

Language Modelling

Paper
Add Code

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

1 code implementation • 20 Feb 2024 • Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash, Preslav Nakov, Timothy Baldwin

The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models.

Language Modelling Multiple-choice +1

Paper
Code

Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon

no code implementations • 3 Feb 2024 • Fajri Koto, Tilman Beck, Zeerak Talat, Iryna Gurevych, Timothy Baldwin

Improving multilingual language models capabilities in low-resource languages is generally difficult due to the scarcity of large-scale data in those languages.

Sentence Sentiment Analysis

Paper
Add Code

LLM360: Towards Fully Transparent Open-Source LLMs

1 code implementation • 11 Dec 2023 • Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze, Preslav Nakov, Tim Baldwin, Eric P. Xing

The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers.

Paper
Code

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU

1 code implementation • 7 Oct 2023 • Fajri Koto, Nurul Aisyah, Haonan Li, Timothy Baldwin

In this work, we introduce IndoMMLU, the first multi-task language understanding benchmark for Indonesian culture and languages, which consists of questions from primary school to university entrance exams in Indonesia.

Multi-task Language Understanding World Knowledge

Paper
Code

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

1 code implementation • 19 Sep 2023 • Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta Muridan, Genta Indra Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.

Document Translation Translation

Paper
Code

Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings

1 code implementation • 15 Sep 2023 • Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, Iryna Gurevych

Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in a situational context, human expectations vary depending on the relevant cultural common ground.

Question Answering

Paper
Code

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

no code implementations • 30 Aug 2023 • Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing

We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs.

Paper
Add Code

CMMLU: Measuring massive multitask language understanding in Chinese

1 code implementation • 15 Jun 2023 • Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Timothy Baldwin

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging.

Large Language Model

551

Paper
Code

Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation

1 code implementation • 24 May 2023 • Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, Timothy Baldwin

However, research on multilingual instruction tuning has been limited due to the scarcity of high-quality instruction-response datasets across different languages.

Instruction Following

Paper
Code

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

1 code implementation • 19 Dec 2022 • Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, JENNIFER SANTOSO, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Farid Adilazuarda, Ryan Ignatius, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Cuk Tho, Ichwanul Muslim Karo Karo, Tirana Noor Fatyanosa, Ziwei Ji, Pascale Fung, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

253

Paper
Code

NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian Languages

no code implementations • 21 Jul 2022 • Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio, Ade Romadhony, Ayu Purwarianti

At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity.

Paper
Add Code

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

2 code implementations • 31 May 2022 • Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

In this work, we focus on developing resources for languages in Indonesia.

Machine Translation Translation

Paper
Code

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

no code implementations • ACL 2022 • Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects.

Paper
Add Code

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

1 code implementation • EMNLP 2021 • Fajri Koto, Jey Han Lau, Timothy Baldwin

We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary.

Language Modelling

Paper
Code

Evaluating the Efficacy of Summarization Evaluation across Languages

1 code implementation • Findings (ACL) 2021 • Fajri Koto, Jey Han Lau, Timothy Baldwin

We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall).

Paper
Code

Discourse Probing of Pretrained Language Models

1 code implementation • NAACL 2021 • Fajri Koto, Jey Han Lau, Timothy Baldwin

Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks.

Sentence

Paper
Code

Top-down Discourse Parsing via Sequence Labelling

1 code implementation • EACL 2021 • Fajri Koto, Jey Han Lau, Timothy Baldwin

We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020).

Ranked #7 on Discourse Parsing on RST-DT (Standard Parseval (Span) metric)

Discourse Parsing

Paper
Code

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

2 code implementations • 27 Nov 2020 • Fajri Koto, Timothy Baldwin, Jey Han Lau

In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences).

Question Answering Semantic Textual Similarity +2

Paper
Code

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

1 code implementation • Asian Chapter of the Association for Computational Linguistics 2020 • Fajri Koto, Jey Han Lau, Timothy Baldwin

In this paper, we introduce a large-scale Indonesian summarization dataset.

Abstractive Text Summarization

Paper
Code

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

no code implementations • COLING 2020 • Fajri Koto, Afshin Rahimi, Jey Han Lau, Timothy Baldwin

Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research.

Benchmarking Language Modelling

Paper
Add Code

Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation

1 code implementation • PACLIC 2020 • Fajri Koto, Ikhwan Koto

Although some linguists (Rusmali et al., 1985; Crouch, 2009) have fairly attempted to define the morphology and syntax of Minangkabau, information processing in this language is still absent due to the scarcity of the annotated resource.

Machine Translation Sentiment Analysis +2

Paper
Code

Improved Document Modelling with a Neural Discourse Parser

1 code implementation • ALTA 2019 • Fajri Koto, Jey Han Lau, Timothy Baldwin

We empirically investigate the benefit of the proposed approach on two different tasks: abstractive summarization and popularity prediction of online petitions.

Abstractive Text Summarization Text Generation

Paper
Code

A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization

no code implementations • LREC 2016 • Fajri Koto

In this paper we report our effort to construct the first ever Indonesian corpora for chat summarization.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.