Q&A: A supercomputing compound's management strategy

Email LinkedIn
Tools

The National Institute for Computational Sciences, a joint project between the Oak Ridge National Lab and the University of Tennessee, provides computing resources to researchers nationwide. The institute boasts the world's most powerful computing complex, including one of the fastest supercomputers, Kraken.

Needless to say, this is a lot to manage and maintain, requiring a centralized management system to ensure consistency across the servers. What's more, thousands of research scientists rely on the supercomputing resources, making server reliability and rapid system recovery top priorities. In an interview with FierceCIO, Stephen McNally, HPC systems administrator for the institute, talked about the decision at NICS to deploy the open source systems management platform from Puppet Labs. 

FierceCIO: Why are automatic provisioning and configuration tools so important in your environment?

Stephen McNally: The main goal of NICS is to provide computer resources to the research community, and our main goal is to keep the machines up and make them reliable enough so scientists can run a day's worth of data. We run about 160 servers. If I needed to make a change in the way SSH is handled on each node, for example, I can do it via Puppet and it will propagate it out to all the nodes.

Being a research environment, we're constantly on the bleeding edge of things, and a lot of researchers want bleeding-edge pieces of software. We are upgrading existing software or redeploying it in other places all the time. 

FCIO: What have you found to be the main benefits of the open source systems management technology you're using?

McNally: The main benefit is standardization across the infrastructure. One of the biggest advantages for us in using Puppet is the fact that we can totally burn down a system and have it back up and running in an hour. If we ever had a system that was compromised, it would be easy to do in that situation. If we had a piece of hardware that failed, with Puppet we basically just reinstall the system via kick-start. So far we've been fortunate and we haven't had many failures where we've had to rebuild machines. We have systems set up in a secure and hardened way via Puppet to mitigate the risks.

FCIO: How easy is the technology to use, and why is ease-of-use so vital in this area?

McNally: I come from a background in which we never used configuration management utilities. When I came here, they had already decided to go with Puppet. The thought was to make the infrastructure easier to pick up. I am by no means a programmer, and I had never used anything configuration-management-wise, and it was extremely easy to pick up. 

The way we're set up is each person has their own areas of expertise and responsibility. In the event that I'm out, the software needs to be easy enough that one of my colleagues can pick it up and trouble shoot an issue.

FCIO: Why do you prefer open source when it comes to your systems management technology?

McNally: As Unix administrators, we're big fans of open source software. We just feel that most of the time open source software is developed to fill a specific need in the community. If I can use an open source piece of software, it's number one on my list. 

FCIO: What is an example of when you haven't been able to use an open source piece of software?

McNally: A scheduling system, for example. People have their preferences for scheduling systems, and while there are open source schedulers available, there are times when the commercial version has additional options.

Related Articles:
Open source use rises in enterprise but still presents challenges 
Linux the first operating system to support USB 3.0 
Q&A: Gunnar Hellekson on open source adoption in government 
Navigating the open source CMS selection process