Papers with Code Dataset Licensing Guide

On Licences

What is it?

Datasets indexed on Papers with Code helps to improve accessibility, reproducibility, and responsible use of a particular dataset artefact. To ensure proper referencing and usage of a dataset, information regarding its licensing is important.

A dataset license should clearly describe how to properly and responsibly use a dataset. It’s intended to help guide the user of the dataset in making informed decisions about how the dataset can and cannot be used.

Why is it important?

Having a proper license attached to a dataset encourages the user of the dataset to make informed decisions on how to use the dataset. In addition, it protects the creator of the dataset from liabilities associated with the use of the dataset. Without a license, there is also risk of malicious and incorrect (intended or unintended) usage of a dataset that could lead to undesired consequences.

Common Dataset Licences

Types of Licenses

Similar to licenses used for software products, licences applied to datasets also vary in restrictions and permissions. Below are some of the main types of dataset licenses used in the field of machine learning. The summaries for the CC licenses provided below are not a substitute for the license. Please refer to the official licensing guide for complete guidance. 

Public domain (CC0)is a public dedication tool, which allows creators to give up their copyright and put their works into the worldwide public domain. CC0 allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, with no conditions.

CC BYThis license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.

CC BY-SA - This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

CC BY-NCThis license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.

CC BY-NC-SAThis license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

CC BY-ND -  This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use. 

CC BY-NC-NDThis license allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

MIT - This license is primarily used for licensing software products. See official license details here.

GPL - This license is a free, copyleft license for software and other kinds of works. See official license details here.

Apache License, Version 2.0 - This license is primarily used for licensing software products. See official license details here.

BSD-3-Clause - This license is primarily used for licensing software products. See official license details here.

Custom - a non-standard license. Check the link to the license to learn more about the exact terms of usage. In some cases we include a brief summary in parenthesis, e.g. Custom (non-commercial) is a custom license stipulating non-commercial usage. 

Unknown - refers to a dataset for which a license is unknown or hasn’t been annotated yet. Check the associated paper and dataset website for more information.

How to license your work?

  • For CC-licensing it is straightforward as you only need to choose the appropriate license that fits your needs. This must be communicated clearly, including a direct link to the license you have chosen. An example could be: This work is licensed under a CC BY 4.0 license.  See official instructions here.
  • For all other types of license used primarily for software products like MIT, GPL, and Apache License 2.0, the instructions vary. Find more information about such licenses in the Open Source Initiative documentations.

Import note

This is not an exhaustive list of licenses available for datasets. These refer to some of the most commonly used licenses. The Creative Common licenses described above refer to the 4.0 version. The descriptions provided for each license above are not a substitute for the license. For the official descriptions of each license you can refer to the provided links.

Common Pitfalls

Regarding dataset licensing, here are some common pitfalls that may occur:

  • Annotation-only licenses. In some cases, especially in Computer Vision it’s common for only the annotations to be released under a free license, but images to remain proprietary (e.g. owned by Flickr) . Check the license terms to understand what is covered and what not. 
  • Code-only licenses. In some cases, only the code to e.g. scrape the data is available. While the code might be open source and freely license, the resulting data may still not be made open and available for everything.
  • Data ownership. Check the description of the dataset, as well as how it was collected, to determine whether you have permission to use the dataset for your purposes. 

Disclaimer

Papers with Code doesn’t make any warranties regarding any of the licensing information provided by the datasets we index. Furthermore, Papers with code disclaims liability for damages resulting from use of the licensing information provided by the creators of the datasets.