Genocide Transcript Corpus (GTC): Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

Introduced by Schirmer et al. in A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

The Topic-Based Paragraph Classification in Genocide-Related Court Transcripts (GTC) dataset is the first reference corpus annotated with samples from genocide tribunals in different international criminal courts. It is made up of witness statements about violence experienced. The material consists of 1475 text passages with about 40 to 120 pages per transcript, covering 3 tribunals: the Extraordinary Chambers in the Courts of Cambodia (ECCC) - 438 pages, the International Criminal Tribunal for Rwanda (ICTR) - 566 pages, and the International Criminal Tribunal of the Former Yugoslavia (ICTY) - 416 pages. As no datasets of any kind containing genocide court transcripts have been published nor other forms of pre-structured or annotated text data in this field of research exist, the aim was to address this gap by providing a systematically annotated dataset.

Potential use cases include genocide-related inquiry conducted by those who need better to access, explore, and search through extensive documentation on these topics including researchers, lawyers and other practitioners. Broadly, its stated aim is to serve 3 purposes:

  • (1) to provide a first reference corpus for the community
  • (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements
  • (3) to explore first steps towards transfer learning within the domain

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages