LogiQA consists of 8,678 QA instances, covering multiple types of deductive reasoning. Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
71 PAPERS • NO BENCHMARKS YET
PearRead is a dataset of scientific peer reviews. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.
32 PAPERS • NO BENCHMARKS YET
Evidence Inference is a corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs.
25 PAPERS • NO BENCHMARKS YET