no code implementations • 27 Jul 2020 • Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments.
no code implementations • 26 Oct 2018 • Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates.
Distributed, Parallel, and Cluster Computing
1 code implementation • 26 Jul 2018 • Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi
We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments.
Distributed, Parallel, and Cluster Computing
no code implementations • 19 May 2015 • Alina Sîrbu, Ozalp Babaoglu
Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions.