LBNL's Operational Data Analytics for Data Center Energy Management
Driven by demand for artificial intelligence, the internet of things (IoT), virtual reality, and other large-scale scientific applications, the demand for high-performance computing (HPC) facilities continues to grow. Through this growth, new technologies and operating procedures are needed to collect and analyze operational data. Achieving and maintaining lasting operational efficiency in this type of environment requires gathering information from all the systems that support the HPC data center, analyzing it, and responding to events in near real-time when necessary. To address this challenge at the National Energy Research Scientific Computing (NERSC) Center, Lawrence Berkeley National Laboratory (LBNL) developed a sophisticated operational data analytics (ODA) system – Operations Monitoring and Notification Infrastructure (OMNI).
HPC data centers often have high power usage with large fluctuations, extensive cooling requirements involving both air and water, and periods of high utilization. Achieving operational efficiency for this type of data center is made even more complex by the fact that data is coming from diverse sources across different functional units and the data sets are large and time-variant.
By using OMNI as a central repository, NERSC has its HPC, facilities, and environmental data seamlessly integrated, providing key operational insights that inform decisions about HPC performance, energy efficiency, procurement, and preventative maintenance.
OMNI collects, centralizes, standardizes, and archives the large amount of environmental and performance data from IT equipment, sensors, and devices on the HPC floor. Currently, OMNI can ingest new data points at an average rate of 25,000 data points per second. The OMNI team’s philosophy is that it is more important to collect 100% of the data and only be able to utilize 80% of it than to collect 80% and be missing something. By placing data in OMNI, it lives in a centralized location ensuring all data categories can be analyzed, correlated, and used together in a single database to answer specialized questions.
As more diverse data are accumulated over time, the team gained the ability to find answers for more complex and specialized questions in the big data. The team summarized three categories of big data’s use cases:
- Real-time: emergency response. Example: During an arc flash event in 2018, the team prioritized actions based on the measured temperature trend data around the equipment.
- Short-term: review of issues. Example: Correcting control sequences during the wildfire event in northern California in November 2018. The NERSC facility experienced an unusual, high air pollution event and had to shut off outdoor air for an extended period. The OMNI data instrumentation and near real-time analytics capabilities was instrumental in correcting control sequences quickly in response to the situation.
- Long-term: design and warranty dispute. Example: Failure analysis for dispute with vendors is another important value for large-scale data storage.
Shortly after moving into Shyh Wang Hall, a NERSC energy efficiency team was assembled and led by Berkeley Lab’s Chief Sustainability Officer with senior management’s support. This team has grown over time and currently includes the lab’s Energy Manager, facilities control engineer, data center energy efficiency researchers, NERSC’s Operations Manager, NERSC’s Energy Manager, and contractor consultants. The team met every week for the first year and had established good collaboration practices and momentum. Now the team meets every other week to discuss operational issues and follow up on existing or new energy and water efficiency opportunities.
The implementation of OMNI data collection infrastructure allowed NERSC to use operational data to reduce energy consumption in its HPC data center. From a facility power planning standpoint, this data informed decisions about the type of facility upgrades and additions needed for each new HPC system added to the data center. Having the operational data of interest available in OMNI enabled the Facilities team to make design decisions that avoided major infrastructure upgrades.
While OMNI allows NERSC to adequately prepare for energy management decisions, it also provides the team with a holistic view of the HPC data center and the environmental information that contributes to the data center’s overall status. This allows NERSC staff to not only treat the symptoms, but to determine the root cause as well; they can also see when a system is not behaving as expected and can respond to hazards as they occur in real-time. On December 31, 2018, NERSC had an arc flash and experienced a level one fire alarm – the lowest alarm that detected smoke but not fire – resulting from a damaged piece of equipment. Operations staff were able to analyze OMNI’s environmental data sets in correlation to the Building Management System (BMS) software which led them to discover that the air handlers were automatically turned off in compliance with fire protocol. OMNI’s dashboard visualizations of BMS data helped Operations staff prioritize which equipment needed immediate attention before the lack of air handlers negatively impacted any of the assets on the HPC floor.
The data from OMNI has been used to lower costs, save hardware, assist with business decisions, influence collaborations within LBNL, and reduce energy consumption. So far, the team has achieved a 37% non-IT energy use savings of 1,800 MWh ($104,000 cost savings) with the implementation of multiple improvement measures. The current Level 2 PUE is stable at around 1.08, but the team is working on lowering it further. NERSC continues to achieve operational efficiency in its current facility and the use of OMNI datasets helps make this possible.
One of the keys to the NERSC team’s success is that they recognize the need for continuous improvements. Because of this, the NERSC energy efficiency team is shifting their focus from retro-commissioning, where they optimized each of the building’s systems, to an ongoing commissioning (OCx) process.
To achieve and maintain lasting operational efficiency at the National Energy Research Scientific Computing (NERSC) Center, Lawrence Berkeley National Laboratory (LBNL) developed a sophisticated data analytics (ODA) system – Operations Monitoring and Notification Infrastructure (OMNI) to provide key operational insights about the data centers performance and energy efficiency.
Data centers with high-performance computing systems face unique energy management challenges due to their scale and complexity.
LBNL developed an operational data analytics system that collects information from multiple HPC and facility systems to enable the operations staff to identify and respond to issues in real-time events.
Operational teams can use real-time data to keep the HPC systems highly available for the scientific community. This data also supports LBNL’s ability to identify new opportunities for improving the HPC’s energy efficiency, including utilization.