Sean McGrath - TCD

With more than 400 physical servers in multiple High Performance Computing Clusters Research, IT in Trinity College Dublin has enough to be doing without manually intervening with each server that periodically breaks. Most of the time when they break it is because of one of a small number of common problems that is relatively easy to remediate but takes time and attention.

Alternative Site Reliability Engineering, (ALT SRE), is like the SRE which Google does, but not as good. It's an in house developed automation tool, (i.e. a script), that ties together the common tools, (sometimes just as simple as 'service blah restart' on a server, or remotely power cycling it), to get the clusters to heal themselves of common problems and free up admins for other tasks.

This lightning talk will discuss why the idea of automating the remediation of common hardware faults came about, how it is implemented, the poor design decisions made along the way as well as some of the pitfalls of having a self-healing computing cluster. And some possible future developments.