Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data

The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. In particular, we define three discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because these three categories are the most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus. We perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Sequential sentence segmentation Abstracts from Arxiv.NI Fine_tuned_Transfer_Learning model Accuracy 65.22 # 1
Sequential sentence segmentation Abstracts from Arxiv.NI+TLT+TPAMI Fine_tuned_Transfer_Learning model Accuracy 75.18 # 1
Sequential sentence segmentation Abstracts from TLT Fine_tuned_Transfer_Learning model Accuracy 79.45 # 1
Sequential sentence segmentation Abstracts from TPAMI Fine_tuned_Transfer_Learning model Accuracy 83.44 # 1

Methods