The HH Red Teaming dataset comprises two distinct types of data, each serving a unique purpose:

  1. Human Preference Data about Helpfulness and Harmlessness:
  2. This dataset is associated with the paper titled "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback."
  3. It provides insights into how humans perceive helpfulness and harmlessness in AI-generated responses.
  4. The data format is straightforward: Each line in the JSONL files contains a pair of texts—one "chosen" and one "rejected."
  5. For helpfulness, the data are grouped into train/test splits from base models, rejection sampling against an early preference model, and a dataset sampled during an iterated "online" process.
  6. For harmlessness, the data are collected for base models and formatted similarly.
  7. Details about the data collection process and crowdworker population can be found in the paper¹.

  8. Red Teaming Data:

  9. This dataset is associated with the paper titled "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned."
  10. It aims to understand how crowdworkers "red team" language models and assess their harmfulness.
  11. Each line in the JSONL file contains a dictionary with the following fields:
    • transcript: A text transcript of a conversation between a human adversary (red team member) and an AI assistant.
    • min_harmlessness_score_transcript: A real value score indicating the harmlessness of the AI assistant (lower scores imply more harm).
    • num_params: The number of parameters in the language model powering the AI assistant.
    • model_type: The type of model powering the AI assistant.
    • rating: The red team member's rating of how successful they were at breaking the AI assistant (Likert scale, higher ratings indicate more success)¹².

Please note that the data may contain content that could be offensive or upsetting, including discussions of abuse, violence, and other sensitive topics. Researchers should engage with the data responsibly and in accordance with their own risk tolerance¹.

Source: Conversation with Bing, 3/17/2024 (1) GitHub - anthropics/hh-rlhf: Human preference data for "Training a .... https://github.com/anthropics/hh-rlhf. (2) Trelis/hh-rlhf-dpo · Datasets at Hugging Face. https://huggingface.co/datasets/Trelis/hh-rlhf-dpo. (3) Anthropic/hh-rlhf at main - Hugging Face. https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages