Recently, Hama core committers Suraj Menon and Thomas Jungblut are working on Fault-tolerant BSP system. And I am trying to read the source code. Their design describes the new BSP computing system and API enabling checkpoint-based recovery. Furthermore, describes the confined recovery, which can be used to improve the cost and latency of recovery.
I didn't fully understand and test yet but quite nice!