Keeping Pace with High Tech: Evolving Power Strategies

The power requirements of most current applications differ from traditional mainframe data centers. Modern information technology (IT), industrial and other facilities usually have several independent critical missions, rather than a single monolithic one. And power systems supporting these applications need to be designed for repeated expansion without disruption to any of the individual proce...

By Bradley S. Walter, Active Power Inc., Austin, Texas March 1, 2003

The power requirements of most current applications differ from traditional mainframe data centers. Modern information technology (IT), industrial and other facilities usually have several independent critical missions, rather than a single monolithic one. And power systems supporting these applications need to be designed for repeated expansion without disruption to any of the individual processes.

An Internet data center, for example, is often composed of many relatively small IT operations with different owners running independent of one another. Individual IT enterprises within the data center run numerous “copies” of an application on multiple servers connected to a large number of networked users. The data center operator must offer these owners widely varying levels of service and support, must provide security among the various IT operations, and must guarantee overall operational security for the entire center.

Industrial users, on the other hand, usually have a number of critical operations or tools that must be supported to continue the process while other production areas sustain a brief power outage until backup generation takes over.

A UPS user may also have multiple sites that require a similar “look and feel” to operating personnel and customers. Finally, multiple independent power paths from utility feeders to the nearest point to the load are a necessary feature of any power system for truly critical loads.

Two Paths Better Than One

In a single path system, the probability of failure of every point in the system adds algebraically. (See “Mean Time Between Failure,” p.18.) If two power paths, each with the same probability of failure, independently supply the load, and these power paths are of poor reliability—each with a probability of failing once every thousand hours and a mean time to restore of one hour—then the probability of a simultaneous failure of the two independent paths becomes 2×10-6and mean time before failure (MTBF) becomes 500,000 hours. The problem is making the power paths truly independent from the farthest upstream point to the closest possible point to the load.

Distributed power systems facilitate the creation of independent power paths and address the issues of scaling, expansion and consistent interface. The large, integrated parallel systems that were often used for mainframe data centers in the 1980s and early 1990s do not.

Other power system differences can result from differences in specific UPS designs. For example, most battery-free UPS systems use line-interactive rather than double-conversion technology. That means the output phase and frequency are not independent of the input except when operating on flywheel power. Synchronization between portions of large distributed systems must be accomplished differently from double-conversion UPS.

To illustrate the similarities and differences between UPS systems designs, consider three frequently used high-reliability UPS system concepts: system-plus-system redundancy (SSR) , distributed-system redundancy (DSR) and isolated-system redundancy (ISR).

All of these design strategies provide redundancy for the greatest possible portion of the power system and permit maintenance and testing of system elements concurrently with fully protected operation of the critical mission. All of these systems are modular, repeating identical power paths from utility connection to final distribution.

The SSR concept might use mirror image systems that are 100% redundant from utility feeders to the static transfer switches. The “normal” and “emergency” inputs of the static transfer switches are divided equally between the two systems. Each of the systems normally operates at less than half rated load. If either system of a pair fails—that is, cannot supply acceptable power to its static switches—the static switches with normal inputs supplied by the failed system then switch to the other system of the pair. If the UPS in one of the paired systems transfers to bypass, then the static switches with normal inputs supplied from that system transfer to the other system by swapping the normal and emergency priorities of their inputs.

The outputs of the two UPS systems must be synchronized within approximately 15 degrees to assure that static switch transfers don’t cause excessive inrush currents or saturate any transformers or inductors downstream of the switches. Line-interactive, battery-free UPS systems cannot synchronize their outputs when connected to inputs that are not synchronized. As long as the transformers are connected properly when installed, this is not normally an issue when operating from utility power. However, this requirement means that the engine generators must be synchronized but not paralleled.

Double Play

Double-conversion UPS systems can synchronize their outputs when their inputs are not synchronized. However, reliability analysis of the system shows that it is important to maintain input synchronism for double-conversion UPS as well. The minimum cut sets (MCS) of each of the two systems upstream of the UPS and its static bypass have two or more components. MCS downstream of the UPS contain single components; i.e., they are single points of failure for the each path individually.

If the inputs of the UPS systems remain synchronized, the UPS is not a single point failure of a path because of its static bypass switch. If the inputs to the UPS systems are not synchronized, the static switch of one of the two UPS systems cannot operate, and the UPS becomes a single point failure for either of the two systems. The difference is significant, because the MTBF of UPS modules without the bypass is typically in the range of 50,000 hours compared to 500,000 hours or longer when the bypass can be used. The MTBF of other single failure points in the critical path upstream of the static transfer switch are also one to two orders of magnitude longer than the UPS without bypass. The end result is that, whether a double-conversion or line-interactive UPS design is used, the inputs of the UPS systems as well as the outputs should be synchronized during operation from engine generator or utility.

This same analysis also points out that, all other factors being equal, reliability is improved by keeping the UPS system as close to the loads as possible, with as few intervening components as possible.

Up to this point, all failure modes discussed involve the loss of one of two redundant power paths to a portion of the critical load. This is the level at which the differences between battery-free and battery-based UPS systems apply. The above analysis demonstrates that the differences in UPS systems have little effect on the electrical design of a SSR system.

However, as a result of greatly reduced floor space for battery-free systems, there are big differences in the physical layout. Battery-free systems in the power ranges typical for large Internet data centers require less than 50% of the space of a battery-based UPS equipped with a five-minute VRLA battery and approximately 20% of the space required for a UPS with a 15-minute wet-cell battery.

Single Failure Points

No discussion of highly redundant systems can be complete without identifying single-point failures. Even though the SSR system is highly redundant, there are still single-point failures and common-cause failures that can affect multiple power paths simultaneously. Unless all loads have dual power cords, all circuit elements below the static transfer switch (STS) are common to the parallel systems and are single points of failure for the critical mission. Clearly, the failure rates and mean time to recover (MTTR) of these components dominate the unavailability of the critical mission. The STS should be as close to the loads as possible. The number of components between the STS and the loads should be minimized, and these components should be chosen for maximum MTBF and minimum MTTR.

The common-cause failures of the two sides of the static switch are equally important. When properly designed, STS have a high degree of independence between the two sides. However, it is impossible to eliminate all components that constitute a common failure point to both sides of the switch. These components affect critical mission availability and MTBF as much as the components that follow the static switch.

There are also common-cause failures for the utility feeders to the building. Even if these feeders are from separate substations, there are still events that will cause simultaneous unavailability of multiple feeders. These common-cause failures are not single-point mission-critical failures, but they force all of the systems into a less reliable state and, therefore, increase the probability per unit time that any system will fail before utility power is restored.

Finally, there are common causes of mission-critical failure that remain within the facility. Fire, explosion, flood, human error, sabotage and lack of redundancy in systems that support the electrical system all come to mind as potential common causes of critical mission failures.

System Differences

SSR systems have the highest level of redundancy. They also require the most equipment, have the highest cost, require the most space and have the lowest operating efficiency.

DSR systems use a similar redundancy concept, but instead of being “1 of 2” redundant, they will be “1 of 3”, “1 of 4” or “1 of 5” redundant. If one system fails, an STS will transfer its loads to multiple systems rather than a single system. In a “1 of 3” redundant system, any two of the three systems must remain operating to support the entire critical load. In normal operation, with no systems failed, the maximum load on any system is two-thirds of system capacity. When one system fails, it transfers half of its load to each of the two other systems. This would allow, for example, three 1,500-kVA systems to be used in a DSR configuration instead of the four 1,000-kVA systems in an SSR system of equal design capacity. Efficiency is improved because systems operate at higher load levels. Also, when a system fails, the step loading of the remaining systems is less.

On the negative side, a second system failure in a DSR system usually causes load loss. In an SSR system it only causes load loss if both systems are part of the same pair. Load management is more difficult for a DSR system over time, because the amount of load that one system can accept from another is only a fraction of the normal system load. Balancing loads between STS is more critical, and it is easier to cause an overload-induced failure on one of the remaining systems when one system fails. In the case of “1 of 3” redundancy, the induced failure will cause more than half of the data center loads to be lost. DSR systems have more complex intersystem connections on the output of the static switches. As a result, they are more subject to human error and are more difficult to expand at a site. They are also more difficult to scale for widely differing capacity requirements while still maintaining a consistent “look and feel.” Increasing the number of systems among which loads must be distributed—1 of 5 instead of 1 of 3—lowers cost, but it also decreases redundancy, increases the difficulty of load management and increases opportunity for human error.

ISR systems use one “reserve” system as a backup for all of the others. The emergency inputs of all static switches connect to the reserve system, which normally operates with little or no load. The level of redundancy is the same as in a DSR system having the same number of UPS, and ISR systems are easier to expand and scale without affecting the theory of operation or the “look and feel” to the operating personnel. Load management is straightforward, and one primary system cannot be overloaded by failure of another. The step load on the reserve system is large when any primary system fails. However, the reserve system can be designed with higher capacity if step loading is a concern for the equipment being used.

In very large installations, more than one reserve system can be used. In that case, the entire load of each primary system is assigned to a single reserve system. The battery-free UPS is particularly well suited to this configuration because it has excellent step-load and overload characteristics. Otherwise, system considerations for UPS application are the same as detailed in the analysis of the SSR system concept.

From Pure Power, Spring 2003

Mean Time Between Failure

In a single-path system, the probability of failure of every point in the system adds algebraically. For 10 items in series, the probability of failure (l) becomes:

If two power paths, each with the same probability of failure, independently supply the load, then where s is the mean time to restore either path, the probability of power interruption to the load becomes:

λ total = 2 Xλ

If these power paths are of poor reliability, each with a probability of failing once every thousand hours—mean time between failure (MTBF) = 1,000; l =1×10

UPS Power Systems Comparison

Three frequently employed high-reliability UPS system configurations are:

System-plus-system redundancy (SSR)

Distributed-system redundancy (DSR)

Isolated-system redundancy (ISR)

All of these design strategies provide redundancy for the greatest possible portion of the power system. SSR systems have the highest level of redundancy but also require the most equipment and space, have the highest cost, and demonstrate the lowest operating efficiency.

In a DSR system, efficiency is improved, because they operate at higher load levels. Also, when a system fails, the step loading of the remaining systems is less. On the negative side, a second system failure in a DSR system usually causes load loss.

For ISR systems, which use one “reserve” system as a backup for all of the others, the level of redundancy is the same as in a DSR system having the same number of UPS, and ISR systems are easier to expand and scale.