Evaluating modular data center design for supercomputing facilities

The modular approach in designing a data center for supercomputers is based on the computing technology. Therefore, an inside-out approach is needed when analyzing different design strategies.

12/05/2017


Learning objectives

  • Understand the differences between supercomputers and mainstream commercial computers.
  • Comprehend what the next generation of computers might look like.
  • Know that modular data centers can be located inside of traditional brick-and-mortar data centers.

For certain go-to-market strategies, modular data centers can present a clear strategic solution for new data center space. Modular data centers are characterized as solutions that create scalable building blocks, offer flexible physical space and infrastructure, and reduce the time required to become operational. These characterizations underscore the importance of how a modular data center works as an integral part of a broader solution. While organizations tend to focus on near-term business results and speed-to-market goals, modular data centers allow planners and engineers to take the long view of their data center strategy and understand the near-continual advancements in hardware and software. As with most other things driven by technology, modular data centers must evolve to keep up with the latest developments without under- or overbuilding, a key principle of modular design. Taking the long view requires the following questions to be answered. The answers will help understand how the current generation of modular data centers is evolving:

  • How do developments in computing technology potentially impact the modular data center?
  • What are the modular concepts that are most likely to be affected?
  • How does this impact the design and construction of modular data centers?

One way to answer these questions is to understand what the next generation of computers might look like. Examining the current class of supercomputing (SC) systems is the best way to do this, knowing that certain aspects of the technology behind these amazing machines will eventually trickle down into commercial systems. The second part is to understand that modular data centers can be located inside of traditional brick-and-mortar data centers; modularity doesn’t necessarily equate to stand-alone, containerized equipment.

Background on supercomputing facilities

Supercomputing facilities follow the basic ideas of modularity. In many cases, especially in major national programs, research institutions looking to develop a new SC site are awarded grant money by the National Science Foundation (NSF). Prior to awarding the grant to a research organization, the NSF will examine the existing and planned facilities that are included in the proposal to ensure current and future financial effectiveness of the facility. During this examination period, a typical question that could be asked by the NSF is, “What power and cooling infrastructure will be in place for future expansion?” This question is a fundamental part of facility modularity. The first cost of the facility cannot be excessive and must support the current power and cooling demands, and the infrastructure must be expandable for next-generation computing use while minimizing future capital costs. Therefore, many projects awarded by the NSF have some degree of modularity, for both current and future facilities.

Trajectory of computing electricity use

The current generation of SC systems has a power demand that is an order of magnitude higher than current state-of-the-art commercial data centers (see Table, “Comparison of power metrics for commercial data centers and high-performance/supercomputing installations”). For example, a new commercial data center could have server cabinets containing 15 kW of computer equipment. Compare this to SC systems that can have cabinet densities that are in the 80- to 100-kW range, with some as high as 150 kW. Also, in the commercial data center, the 15-kW cabinets will most likely be grouped together in a row along with many more low-density cabinets. Most SC installations have dedicated areas for these ultra-high-density server enclosures, with storage and network equipment located in a separate area. When the commercial and SC data center examples are translated into electrical density within the data center space, the commercial facility will have densities of 150 to 200 W/sq ft of data center floor space. Compare this to the current fastest supercomputer in the world (Sunway-TaihuLight located at the National Supercomputing Center in Wuxi, China), which has a density estimated at 1,500 W/sq ft. Granted, this is certainly an extreme example, but it shows the trajectory that computing technology is on.

Having examined the power at the cabinet level, we need to investigate the overall power of the computing system. The top 10 most powerful computing systems in the world have a combined total power demand of 80 MW, and the highest individual system draws 18 MW. A typical stand-alone commercial data center might fall into a range of 1 to 4 MW. This difference demonstrates why cooling strategies for SC systems can be very different than those in commercial data centers. But as enterprise information technology (IT) equipment persists in increasing power and capability, it will become more common to see the cooling solutions that are normally reserved for SC installations applied to the commercial data center market.

Trajectory of computational power

Figure 4: Over the period of 2007 to 2016, the minimum power demand of servers decreased by 66%. This has a positive impact on annual energy use, especially when the servers typically run light or intermittent workloads. There is a direct correlation between computing power and energy use (see Figure 1). One measure of this correlation comes from The Green500, a listing of the top 500 most energy-efficient SC installations. (According to the Green500, the list was first developed to counter the “performance-at-any-cost” mentality, which was building within the SC industry at the time.) The Green500 uses an energy efficiency benchmark based on performance per watt. This benchmark demonstrates how a given computer will perform in relation to the power it consumes. The paradox is that the computers will use the same or more energy as the previous generation, but will have a marked improvement in computing performance, thus leading to a higher performance-per-watt ratio (see Figure 2). The computational power-to-electricity-use ratio is also impacted by the media that is used to cool the computers, specifically air or water. Using air cooling will increase energy use and reduce the computational power (see Figure 3). Using water to cool the internal components of the computer allows for a more effective transfer of heat away from the component and to the water. In simple terms, this allows the processors to run faster and produce higher computational effectiveness. This type of energy efficiency comes solely from the computing side of the equation; all else being equal, the facility will still see the same (or more) energy use as it did when using the previous generation of computing systems. This is an important reason to have the most energy-efficient cooling and power-conversion systems as possible.

Another important enhancement in the newer servers is the ability to more closely match the workload that is running on the computer to the actual electricity usage. This is analogous to speeding up and slowing down a car—the more energy expended by the engine, the higher the speed, and vice versa. This may seem like a simple concept, but older generations of commercial servers would run at a minimum of 30% to 50% of the rated power even if the computer was idle. The energy use of the facility benefits from these enhancements to the servers because the cooling systems only need to provide cooling when the servers are truly running a workload, where prior to this, the computers were being cooled even if there was no workload running (see Figure 4).

Running software on supercomputers

Because the purpose of an enterprise server and a supercomputer may be very different, the types of software also will be very different. And the way the computers are operated also is very different. This will have a direct impact on energy usage of the two types of computer systems. In commercial data centers, the servers and the network gear will typically operate anywhere from 50% to a near 0% utilization, based on the application and architecture of the computing system. Even using virtualized computers, where many servers are imaged onto a single server, will drive this utilization upward, but it is unlikely that it would ever approach 100%. The percent utilization will change throughout the day and week, correlating to times of heavy use. Most of the applications for enterprise servers run nonstop, 24/7 year-round.

SC systems will typically be running analyses, modeling, simulations, or some other deeply complex research for a designated period of time. When the analysis is done, the computer is ramped down. This resembles more of an on-off operation as compared with the gradual fluctuations seen in the commercial servers. This on-off operation requires that the cooling systems react very quickly and provide nearly full cooling almost instantaneously when the computers are started.

High-performance requires modular design

So why are these massively powerful computers affecting the current thinking on how modular data centers are designed and constructed? One answer is thermal management. For computers with hundreds of cores per single cabinet, a two-part solution is needed for optimal thermal management. The first is the usual: Keep the data center at an appropriate temperature to ensure proper operation of the computer, storage cabinets, and any ancillary systems housed in the data center. The second part consists of keeping the central processing unit (CPU), memory, and graphics processing unit (GPU) within the servers at the optimal temperature.

Generally, when these components are kept “cold,” they run faster and will be able to run more computations in a shorter amount of time. There ends up being two systems, one dedicated to cooling the data center and maintain the required environmental conditions (such as defined in the ASHRAE thermal guidelines), and the other dedicated to cooling the computers directly. The system providing general cooling for the data center must have the capability to mitigate the heat transferred by convection from the computers to the data center. Depending on the cooling used for the computers, the percentage of heat could be anywhere between 10% to close to 30%. On a cabinet filled with very powerful computers (assume 40 kW), this will translate to 4 kW to 12 kW per cabinet that needs to be cooled by the room cooling system. Even if there is a secondary cooling system to cool the computers directly, the heat that is released to the data center is not insignificant and must be accounted for. Having a modular design for the data center will allow for the uniform and deliberate growth of this type of hybrid air and water system. But due to the extreme power density, a traditional modular approach will generally not be effective in providing the appropriate cooling resources.

How supercomputing is influencing modular design

Modular data center design for SC systems must be looked at from the inside out. What does this mean? Modular data centers are typically described in terms like “blocks” or, as the name implies, “modules.” The HVAC and electrical engineers size equipment so that, as the data center grows, new equipment (uninterruptible power supply systems, power distribution units, chillers, pumps, and air conditioning units) can be added in a uniform fashion to closely match the capacity of the new IT load. This method is completely logical and appropriate. It is an outside-in process where the engineers, for the most part, might have little information on the specific computer technology that will be put into the facility, except for the overall capacity and capability requirements of the central power and cooling systems. Colocation data centers are a good example of this approach. This type of facility should scale to a customer’s needs while at the same time minimize capital costs. Ultimately, it must provide a solution to prospective tenants that is flexible, has expansion capabilities, and has a speed to market far better than other options. In general, the overall cooling and electrical capacity, power density (watt per square foot), and expansion capability are the primary metrics used in evaluating in this type of facility, even before much is known about the specific IT hardware that will be installed.

How does the outside-in process described above compare to an inside-out process? It starts with how much is known about the specific type of computing technology being proposed. Commercial data centers will have the capability to provide power and cooling systems that are flexible, depending on the design criteria. But there is a limit to how much is available. For example, the cooling systems may not be able to support a server cabinet rated at more than 10 kW. Or the electrical systems cannot carry a load of more than 150 to 200 W/sq ft. These are examples of a data center that is defined using an outside-in approach to ensure the robustness of the systems supporting the IT hardware. The paradox is that the flexibility and scalability capabilities also limit the type of IT hardware that can be installed. And while the design may be modular, the speculative nature of the data center limits the use of some computing systems, such as supercomputers. A quick example of this: Cabinets for current HPC/SC data centers are as high as 385 kW per cabinet, and the electric density is upwards of 1,500 W/sq ft (versus 10 kW and 200 W/sq ft in commercial data centers, respectively). This is not meant to suggest that a colocation facility should be designed to house a supercomputer, but rather to illustrate the differences between the different approaches used in modular design.

Challenges in designing power and cooling

Tenants, owners, and users of commercial data centers will likely install commercial off-the-shelf (COTS) servers, storage, and network gear. Compare this with supercomputers that are extremely powerful and use purpose-built power and cooling subsystems. These are far from being COTS. This is where the inside-out design approach comes in. Over the years, many innovative ideas on how to cool the computer hardware have been developed. And most industry experts would agree that it is necessary to cool highly concentrated, extreme IT loads using a heat-transfer medium, such as water. Using water cooling is a key component to modular design approach.

While there are variations among the computer manufacturers in the specific design approach for water-cooling the computers, there are some commonalities. Inside the servers, there are three primary heat sources: the CPU, memory (DIMM), and the GPU. When the computers are under load, the percent that each of these components produce heat will change based on the manufacturer, computation process, and external components, such as networking and storage. The internal components used in a computer with internal liquid cooling (referred to as close-coupled) will transfer heat to thermal sinks mounted directly onto the component. These heat sinks have cold water circulating through them, and after the heat is transferred to the liquid, the warm water returns to a heat exchanger located in a pumping unit that is external to the computer cabinet. The pumping unit has the capacity for three server cabinets, so the number of pumping units depends on the number of computer cabinets in the data center. It is possible that as the power requirements of computers continue to grow, modular data centers of the future will have chilled-water piping distribution built directly into the prefabricated infrastructure, allowing for an inside-out approach when designing the cooling systems.

Liquid-cooling examples

When we start to look at options for the systems that directly cool the computers, there are two main categories of cooling:

External cooling. This term refers to cooling systems that are typically mounted on or near the server cabinet. Examples include:

  • Rear-door heat exchangers (RDHX). These are fan coils mounted directly on the rear door of the server cabinet. The concept behind this design is to treat the hot exhaust-air coming from the servers by cooling it down to room temperature and then discharging it back into the room. This approach does not require a hot-aisle, cold-aisle design because the air discharging from the back of a server cabinet is the same temperature as the room.
  • Overhead modular cooling units. This equipment is comprised of fans and chilled water or refrigerant coils either mounted on top of the server cabinet or hung from the ceiling. In high-density applications, there will be a one-to-one correlation between the cooling unit and the server cabinet. These units operate by drawing hot air from the hot aisle, cooling down the air, and discharging it directly into the cold aisle.
  • In-row fan coil units. Here, the units are mounted alongside the server cabinets. They draw air in from the hot aisle, cool it down, and then discharge it into the cold aisle. The row of server cabinets also can be enclosed to completely isolate the servers from the surrounding data center environment. The in-row fan coil units simply recirculate the air from back to front to maintain the required environmental conditions.

Internal cooling. This involves removing heat directly from the components in the server, which is the most efficient. There will typically be chilled-water piping located below the raised floor or mounted overhead in the aisles. The entire main piping loop will most likely have been installed during the initial construction of the facility. This also is highly modular; as new computers are installed, the cooling hoses to/from the new computer will be “plugged” into the chilled-water loop. (Note that the internal components are not specified or designed by the mechanical engineer). Examples include:

  • Direct-to-chip CPU cooler, GPU cooler, and memory cooler. This approach uses cold plates (some with integral pumps) that mount directly on the respective component in the server and transfer heat to an external heat exchanger.
  • Thermal bus with dry disconnect. In this configuration, a copper bus is mounted inside the server cabinet. Servers with special side-mounted copper plates make direct contact with the thermal bus. The heat flows from inside the server to the water-cooled thermal bus, then the warm water flows to a heat exchanger.

These cooling solutions have an important commonality: Each can be used in a highly modular environment. As more servers come online, additional cooling equipment is deployed in the data center. Certainly, the central plant cooling and power equipment (chillers, pumps, switchgear, UPS, etc.) will need to expand in step with the server growth, but the installation of the cooling equipment in the data center becomes much more flexible as to when it needs to be put in place. (This, of course, assumes that the primary infrastructure, such as water or refrigerant piping, has been installed with the necessary connections.)

Final thoughts

Many of the solutions discussed in this article are not solely meant for SC systems; there is cross-over with other types of data centers, especially ones that have high densities or limited air-cooling ability. More important, the definition of a supercomputer continues to evolve: Some mainstream commercial computers are as or more powerful than supercomputers were just 5 or 10 years ago. The modular approach in designing a data center for a supercomputer is based on the computing technology itself, therefore, an inside-out approach is needed when analyzing different design strategies.

About the author

Bill Kosik is a senior mechanical engineer, mission critical at EXP U.S. Services Inc. in Chicago. He also is a member of the Consulting-Specifying Engineer editorial advisory board. 

Learning objectives

  • Understand the differences between supercomputers and mainstream commercial computers.
  • Comprehend what the next generation of computers might look like.
  • Know that modular data centers can be located inside of traditional brick-and-mortar data centers.

For certain go-to-market strategies, modular data centers can present a clear strategic solution for new data center space. Modular data centers are characterized as solutions that create scalable building blocks, offer flexible physical space and infrastructure, and reduce the time required to become operational. These characterizations underscore the importance of how a modular data center works as an integral part of a broader solution. While organizations tend to focus on near-term business results and speed-to-market goals, modular data centers allow planners and engineers to take the long view of their data center strategy and understand the near-continual advancements in hardware and software. As with most other things driven by technology, modular data centers must evolve to keep up with the latest developments without under- or overbuilding, a key principle of modular design. Taking the long view requires the following questions to be answered. The answers will help understand how the current generation of modular data centers is evolving:

  • How do developments in computing technology potentially impact the modular data center?
  • What are the modular concepts that are most likely to be affected?
  • How does this impact the design and construction of modular data centers?

One way to answer these questions is to understand what the next generation of computers might look like. Examining the current class of supercomputing (SC) systems is the best way to do this, knowing that certain aspects of the technology behind these amazing machines will eventually trickle down into commercial systems. The second part is to understand that modular data centers can be located inside of traditional brick-and-mortar data centers; modularity doesn’t necessarily equate to stand-alone, containerized equipment.

Background on supercomputing facilities

Supercomputing facilities follow the basic ideas of modularity. In many cases, especially in major national programs, research institutions looking to develop a new SC site are awarded grant money by the National Science Foundation (NSF). Prior to awarding the grant to a research organization, the NSF will examine the existing and planned facilities that are included in the proposal to ensure current and future financial effectiveness of the facility. During this examination period, a typical question that could be asked by the NSF is, “What power and cooling infrastructure will be in place for future expansion?” This question is a fundamental part of facility modularity. The first cost of the facility cannot be excessive and must support the current power and cooling demands, and the infrastructure must be expandable for next-generation computing use while minimizing future capital costs. Therefore, many projects awarded by the NSF have some degree of modularity, for both current and future facilities.

Trajectory of computing electricity use

The current generation of SC systems has a power demand that is an order of magnitude higher than current state-of-the-art commercial data centers (see Table, “Comparison of power metrics for commercial data centers and high-performance/supercomputing installations”). For example, a new commercial data center could have server cabinets containing 15 kW of computer equipment. Compare this to SC systems that can have cabinet densities that are in the 80- to 100-kW range, with some as high as 150 kW. Also, in the commercial data center, the 15-kW cabinets will most likely be grouped together in a row along with many more low-density cabinets. Most SC installations have dedicated areas for these ultra-high-density server enclosures, with storage and network equipment located in a separate area. When the commercial and SC data center examples are translated into electrical density within the data center space, the commercial facility will have densities of 150 to 200 W/sq ft of data center floor space. Compare this to the current fastest supercomputer in the world (Sunway-TaihuLight located at the National Supercomputing Center in Wuxi, China), which has a density estimated at 1,500 W/sq ft. Granted, this is certainly an extreme example, but it shows the trajectory that computing technology is on.

Figure 1: Supercomputing power and electrical power from 1990 to 2020. Note that efficiency (MFLOP/s/w) is increasing more quickly than the computer power (FLOP/s). Courtesy: EXP U.S. Services Inc.

Having examined the power at the cabinet level, we need to investigate the overall power of the computing system. The top 10 most powerful computing systems in the world have a combined total power demand of 80 MW, and the highest individual system draws 18 MW. A typical stand-alone commercial data center might fall into a range of 1 to 4 MW. This difference demonstrates why cooling strategies for SC systems can be very different than those in commercial data centers. But as enterprise information technology (IT) equipment persists in increasing power and capability, it will become more common to see the cooling solutions that are normally reserved for SC installations applied to the commercial data center market.


<< First < Previous Page 1 Page 2 Next > Last >>

Consulting-Specifying Engineer's Product of the Year (POY) contest is the premier award for new products in the HVAC, fire, electrical, and...
Consulting-Specifying Engineer magazine is dedicated to encouraging and recognizing the most talented young individuals...
The MEP Giants program lists the top mechanical, electrical, plumbing, and fire protection engineering firms in the United States.
Exploring fire pumps and systems; Lighting energy codes; Salary survey; Changes to NFPA 20
How to use IPD; 2017 Commissioning Giants; CFDs and harmonic mitigation; Eight steps to determine plumbing system requirements
2017 MEP Giants; Mergers and acquisitions report; ASHRAE 62.1; LEED v4 updates and tips; Understanding overcurrent protection
Knowing when and how to use parallel generators
Power system design for high-performance buildings; mitigating arc flash hazards
Transformers; Electrical system design; Selecting and sizing transformers; Grounded and ungrounded system design, Paralleling generator systems
As brand protection manager for Eaton’s Electrical Sector, Tom Grace oversees counterfeit awareness...
Amara Rozgus is chief editor and content manager of Consulting-Specifier Engineer magazine.
IEEE power industry experts bring their combined experience in the electrical power industry...
Michael Heinsdorf, P.E., LEED AP, CDT is an Engineering Specification Writer at ARCOM MasterSpec.
Automation Engineer; Wood Group
System Integrator; Cross Integrated Systems Group
Fire & Life Safety Engineer; Technip USA Inc.
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
click me