OpenAsp
is an Open Aspect-based Multi-Document Summarization dataset derived from DUC
and MultiNews
summarization datasets.
To generate OpenAsp
, you require access to the DUC
dataset which OpenAsp is derived from.
fwdrequestingducdata.zip
git clone https://github.com/liatschiff/OpenAsp.git
conda
or virtualenv
environment:conda create -n openasp 'python>3.10,<3.11'
conda activate openasp
spacy
)pip install -r requirements.txt
copy fwdrequestingducdata.zip
into the OpenAsp
repo directory
run the prepare script command:
python prepare_openasp_dataset.py --nist-duc2001-user '<2001-user>' --nist-duc2001-password '<2001-pwd>' --nist-duc2006-user '<2006-user>' --nist-duc2006-password '<2006-pwd>'
from glob import glob
import os
import gzip
import shutil
from datasets import load_dataset
openasp_files = os.path.join('openasp-v1', '*.jsonl.gz')
data_files = {
os.path.basename(fname).split('.')[0]: fname
for fname in glob(openasp_files)
}
for ftype, fname in data_files.copy().items():
with gzip.open(fname, 'rb') as gz_file:
with open(fname[:-3], 'wb') as output_file:
shutil.copyfileobj(gz_file, output_file)
data_files[ftype] = fname[:-3]
# load OpenAsp as huggingface's dataset
openasp = load_dataset('json', data_files=data_files)
# print first sample from every split
for split in ['train', 'valid', 'test']:
sample = openasp[split][0]
# print title, aspect_label, summary and documents for the sample
title = sample['title']
aspect_label = sample['aspect_label']
summary = '\n'.join(sample['summary_text'])
input_docs_text = ['\n'.join(d['text']) for d in sample['documents']]
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *')
print(f'Sample from {split}\nSplit title={title}\nAspect label={aspect_label}')
print(f'\naspect-based summary:\n {summary}')
print('\ninput documents:\n')
for i, doc_txt in enumerate(input_docs_text):
print(f'---- doc #{i} ----')
print(doc_txt[:256] + '...')
print('* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *\n\n\n')
load_dataset()
- you may want to delete huggingface datasets cache folderNLTK
or spacy
model which affects the sentence tokenization process. You must use exact versions as pinned on requirements.txt
.The prepare_openasp_dataset.py
script downloads DUC
and Multi-News
source files, uses sacrerouge
package to
prepare the datasets and uses the openasp_v1_dataset_metadata.json
file to extract the relevant aspect summaries and compile the final OpenAsp dataset.
This repository, including the openasp_v1_dataset_metadata.json
and prepare_openasp_dataset.py
, are released under APACHE license.
OpenAsp
dataset summary and source document for each sample, which are generated by running the script, are licensed under the respective generic summarization dataset - Multi-News license and DUC license.
Paper | Code | Results | Date | Stars |
---|