Beyond Mission Critical - to Hypercritical
Six nines of reliability may no longer satisfy today's mission-critical facilities
By David Vandyne, P.E., President IDC Engineering Lima, Ohio -- Consulting-Specifying Engineer, 3/1/2003
It was once thought that electrical reliability had reached its peak at "six nines." And for many facilities, it is still the standard. As a measure of electrical system availability, six nines means that the power supply is available 99.9999% of the time (see "Average System Availability Index below). This level of power reliability often came at the substantial cost of an uninterruptible power supply (UPS), as well as a backup generator.
At today's modern plant, however, six nines falls woefully short of the need. In fact, many plant maintenance and operations professionals acknowledge that when it comes to reliability, there just aren't enough nines. They want 100% reliability.
In the real world, perfect reliability may be fiction, but properly evaluated, the appropriate level of reliability can be justified and applied based on hard analysis and prudent application of currently available hardware. The key to the evaluation is knowledge of the elements that control reliability and of the actual costs of outages. Ironically, reliability seems to be on the decline at a time when the need for reliability has never been greater.
Growing Need for NinesIn the early seventies, a plant electrical outage was considered a mere nuisance; hence, the expression "nuisance trip" came into use. A circuit breaker tripped off when it wasn't supposed to, and it was considered a "nuisance." In some instances, this may still be the case. In others, the result can be no less than economic catastrophe.
Consider the result of the loss of a well-placed circuit breaker at a bank's central computer center. The inadvertent trip of a single circuit breaker could shut down the data center for just a few minutes, and cost the bank billions (yes, billion with a "b"). The cost in penalties, as well as massive database corruption, and halted world-wide trading can be massive.
And that kind of profound loss is not just limited to the banking industry. Manufacturing has seen the proliferation of just-in-time delivery of components. Even though it is often the bane of facility engineers, JIT delivery offers real savings and is a competitive key for many manufacturers. To keep the pipeline full of quality parts, logistics programs have vertically integrated JIT and often track production into the supplier plants. There is a need to track every part in the system at all times, and provide feedback to suppliers for quality control and production rate variation. To further complicate matters, with the globalization of manufacturers and suppliers, these logistic systems are on-line for all time zones and holiday schedules.
A computer "crash" during production could cause extensive corruption to the system database, to the point where manual reconstruction of the data would be necessary. The result could be the simultaneous halt of all production lines for manufacturers and their suppliers until the database can be rebuilt and put back on line. The cost can, and sometimes does, run into the tens of millions for a single event. Fairly recent work by the Electric Power Research Institute (EPRI) and others generally suggests power events cost far below this level. It's important to remember that these values are industry averages and may not represent a specific facility's situation.
Compounding matters is the fact that plant maintenance personnel, who were once able to perform preventative maintenance on the systems that supply these computer systems, can no longer get the planned outages necessary to perform some of the more basic functions. Certainly, while technologies such as infrared scanning have made identifying problems easier, repairs still often require outages or at least a risk of an outage. Neither of these options is acceptable for many of today's facilities. Only five years ago, one could still belief that everyone can take an outage, but even risking the slightest interruption today is unthinkable.
So exactly what level of reliability is necessary, and how do we achieve it? If certain portions of operations can no longer stand outages, what will the situation be five years from now? Most important, what can be put in place now while facility managers have the chance to facilitate reasonable maintenance?
Assessing ReliabilityThe starting point for any engineering approach can be found in the Institute for Electrical and Electronic Engineers (IEEE) Gold Book, which is IEEE Standard 943-1990. The premise of the Gold Book is to establish the failures per 100 of each major type of electrical distribution equipment, along with the time to repair. Based on the number of each component and its configuration, a statistical number of hours of outage can be calculated for a given facility, and for each electrical location within that facility. Using then a cost for such an outage, a cost per year for outages can be calculated.
Once the model is built in this fashion, it can be modified and the annual cost of outage calculated. In this manner, the reduction of outage cost can be compared to the cost of the improvements.
One of the most significant elements in the model is the reliability of the utility service. IEEE Std 943-1990 breaks this out into several variables, including delivery voltage and length of outage. The shorter the outage, the more likely it is to occur.
For example, for a medium voltage supply (15 kV , V < 35 kV) with multiple feeds, there is an average of 6.4 failures per hundred services, per year, that take at least 115 minutes to correct. For a short-term outage of less than four minutes, there are around 50 failures per hundred, or in other words, around 6.4% probability of a long-term and 50% probability of short-duration outage in any year.
Using these two figures alone–the improvement inherent in no longer being susceptible to the short-term outages from the utility and the cost of an outage—can be the cornerstone in the justification of a basic UPS system. For a completely accurate assessment, an evaluation should include an analysis of the electrical gear between the utility load and the critical system, along with the mean time between failures (MTBF) of the UPS. The IEEE Gold Book offers a variety of estimates.
The hypercritical system usually demands further treatment beyond the simple UPS. Usually, two UPS units in parallel are necessary to provide adequate system redundancy and to overcome the MTBF of a single UPS. But the typical UPS is not designed for long-term outages. Furthermore, support of data centers often includes air conditioning, lights and telecommunication, which mean high load levels that will call for a standby generator for emergency power. Again, according to the Gold Book, there are reliability values available.
Once started and brought under load, a generator will typically demonstrate high reliability, provided it has been reasonably maintained. For a standby generator, the critical value is the probability that the unit will not start, or will not pick up the load. The value for this application is 1.35 failures per hundred starts. Because this only happens when the primary supply fails, the improvement can be simply derived by multiplying the source outage cost (for long term outages) by 0.0135 and subtracting the result from the original cost. This annual savings is the value of a single standby generator.
Comparing this cost to the the generator's annual carrying cost plus maintenance is an accurate assessment of the investment. For the hypercritical facility, multiple generators can be justified by successively multiplying the annual outage cost by 0.0135, until the improvement no longer justifies adding another generator.
Once again, the calculated result should include the electrical system required to bring the generator on line: namely, the throwover switch. Once again, the effort to provide reliability can create pitfalls. A typical automatic throwover is little more than two molded-case switches with a common actuator. These are among the least reliable switches. According to the IEEE Gold Book, typical 1,200-amp drawout-type breakers, for example, have a failure rate of 0.0030 per unit year. But the molded-case type has a failure rate of 0.0096—three times that of a draw-out type air circuit breaker.
Furthermore, unless a complex set of bypass switches is installed, such a system cannot be maintained without an outage. Even with bypass switches, reliability during maintenance is severely hampered. Fortunately, for a premium there are throwover switches available with integral bypass options that allow for proper maintenance of the throwover, without significantly degrading the reliability of the overall unit.
Taking ActionAn assessment of each facility's power reliability needs is a unique analysis. It depends on a variety of factors, including: base utility reliability, configuration of the plant electrical system and the nature of the critical loads themselves. In some cases, critical equipment can be "hardened" using UPS systems, or by employing latch-and-hold type contactors. But today many facilities are simply hypercritical—these plant cannot be shut down under any circumstances.
In any event, the means of assessing system reliability has changed little over the years. What has changed are the enormous costs of power outages. The rapidly expanding costs of power events have substantially changed the math for justifying backup systems, which may include multiple generators and utility supplies, and of course, UPS systems.
From Pure Power, Spring 2003
|













