BRC and LBL HPC senior engineer Michael Jennings will be giving a talk on the "Node Health Check (NHC)" on Feb 4, 2014 at the Stanford Conference and Exascale Workshop 2014 sponsored by the HPC Advisory Council.
TORQUE, SLURM, and other schedulers/resource managers provide for a periodic "node health check" script to be executed on each compute node to verify that the node is working properly. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to prevent jobs from being scheduled or run on them. This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failures, etc. Though many sites have created their own scripts to serve this function, the vast majority are one-off efforts with little attention paid to extensibility, flexibility, reliability, speed, or reuse.
NHC, developed by Jennings, provides the framework and implementation for a highly reliable, flexible, extensible node health check solution. It is now widely recommended by major HPC job scheduler vendors and is in use at many large HPC sites and research institutions.