Some tasks are inferred based on the benchmarks list.
The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.
The dataset aims to provide system prompts and user prompts for assistant. You should make random pairs and compute human preference for both system prompt obedience and user prompt relevance through A/B testing.