pd4ml is a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics.
It currently consists on 5 datasets:
- Top Tagging Landscape (Classification)
- Train/val/test: 1.2M/400k/400k
- Structure: Four vectors
- Dimension: 200 particles, 4 features/particle
- Smart Backgrounds (Classification)
- Train/val/test: 157k/39k/84k
- Structure: Decay Graph
- Dimension: 100 particles, 9 features/particle
- Spinodal or Not (Classification)
- Train/val/test: 16.3k/4k/8.7k
- Structure: 2D Histogram
- Dimension: 20x20 histogram of pion spectra
- EoS (Classification)
- Train/val/test: 121k/25k/54k
- Structure: 2D Histogram
- Dimension: 24x24 histogram of pion spectra
- Air Showers (Regression)
- Train/val/test: 56k/30k/14k
- Structure: 81 1D Traces
- Dimension: 81 stations, 80 signal bins + timing