Beyond Mission Critical – to Hypercritical

It was once thought that electrical reliability had reached its peak at "six nines." And for many facilities, it is still the standard. As a measure of electrical system availability, six nines means that the power supply is available 99.9999% of the time (see "Average System Availability Index below).

By David Vandyne, P.E., President IDC Engineering Lima, Ohio March 1, 2003

It was once thought that electrical reliability had reached its peak at “six nines.” And for many facilities, it is still the standard. As a measure of electrical system availability, six nines means that the power supply is available 99.9999% of the time (see “Average System Availability Index below). This level of power reliability often came at the substantial cost of an uninterruptible power supply (UPS), as well as a backup generator.

At today’s modern plant, however, six nines falls woefully short of the need. In fact, many plant maintenance and operations professionals acknowledge that when it comes to reliability, there just aren’t enough nines. They want 100% reliability.

In the real world, perfect reliability may be fiction, but properly evaluated, the appropriate level of reliability can be justified and applied based on hard analysis and prudent application of currently available hardware. The key to the evaluation is knowledge of the elements that control reliability and of the actual costs of outages. Ironically, reliability seems to be on the decline at a time when the need for reliability has never been greater.

Growing Need for Nines

In the early seventies, a plant electrical outage was considered a mere nuisance; hence, the expression “nuisance trip” came into use. A circuit breaker tripped off when it wasn’t supposed to, and it was considered a “nuisance.” In some instances, this may still be the case. In others, the result can be no less than economic catastrophe.

Consider the result of the loss of a well-placed circuit breaker at a bank’s central computer center. The inadvertent trip of a single circuit breaker could shut down the data center for just a few minutes, and cost the bank billions (yes, billion with a “b”). The cost in penalties, as well as massive database corruption, and halted world-wide trading can be massive.

And that kind of profound loss is not just limited to the banking industry. Manufacturing has seen the proliferation of just-in-time delivery of components. Even though it is often the bane of facility engineers, JIT delivery offers real savings and is a competitive key for many manufacturers. To keep the pipeline full of quality parts, logistics programs have vertically integrated JIT and often track production into the supplier plants. There is a need to track every part in the system at all times, and provide feedback to suppliers for quality control and production rate variation. To further complicate matters, with the globalization of manufacturers and suppliers, these logistic systems are on-line for all time zones and holiday schedules.

A computer “crash” during production could cause extensive corruption to the system database, to the point where manual reconstruction of the data would be necessary. The result could be the simultaneous halt of all production lines for manufacturers and their suppliers until the database can be rebuilt and put back on line. The cost can, and sometimes does, run into the tens of millions for a single event. Fairly recent work by the Electric Power Research Institute (EPRI) and others generally suggests power events cost far below this level. It’s important to remember that these values are industry averages and may not represent a specific facility’s situation.

Compounding matters is the fact that plant maintenance personnel, who were once able to perform preventative maintenance on the systems that supply these computer systems, can no longer get the planned outages necessary to perform some of the more basic functions. Certainly, while technologies such as infrared scanning have made identifying problems easier, repairs still often require outages or at least a risk of an outage. Neither of these options is acceptable for many of today’s facilities. Only five years ago, one could still belief that everyone can take an outage, but even risking the slightest interruption today is unthinkable.

So exactly what level of reliability is necessary, and how do we achieve it? If certain portions of operations can no longer stand outages, what will the situation be five years from now? Most important, what can be put in place now while facility managers have the chance to facilitate reasonable maintenance?

Assessing Reliability

The starting point for any engineering approach can be found in the Institute for Electrical and Electronic Engineers (IEEE) Gold Book , which is IEEE Standard 943-1990. The premise of the Gold Book is to establish the failures per 100 of each major type of electrical distribution equipment, along with the time to repair. Based on the number of each component and its configuration, a statistical number of hours of outage can be calculated for a given facility, and for each electrical location within that facility. Using then a cost for such an outage, a cost per year for outages can be calculated.

Once the model is built in this fashion, it can be modified and the annual cost of outage calculated. In this manner, the reduction of outage cost can be compared to the cost of the improvements.

One of the most significant elements in the model is the reliability of the utility service. IEEE Std 943-1990 breaks this out into several variables, including delivery voltage and length of outage. The shorter the outage, the more likely it is to occur.

For example, for a medium voltage supply (15 kV , V & 35 kV) with multiple feeds, there is an average of 6.4 failures per hundred services, per year, that take at least 115 minutes to correct. For a short-term outage of less than four minutes, there are around 50 failures per hundred, or in other words, around 6.4% probability of a long-term and 50% probability of short-duration outage in any year.

Using these two figures alone—the improvement inherent in no longer being susceptible to the short-term outages from the utility and the cost of an outage—can be the cornerstone in the justification of a basic UPS system. For a completely accurate assessment, an evaluation should include an analysis of the electrical gear between the utility load and the critical system, along with the mean time between failures (MTBF) of the UPS. The IEEE Gold Book offers a variety of estimates.

The hypercritical system usually demands further treatment beyond the simple UPS. Usually, two UPS units in parallel are necessary to provide adequate system redundancy and to overcome the MTBF of a single UPS. But the typical UPS is not designed for long-term outages. Furthermore, support of data centers often includes air conditioning, lights and telecommunication, which mean high load levels that will call for a standby generator for emergency power. Again, according to the Gold Book , there are reliability values available.

Once started and brought under load, a generator will typically demonstrate high reliability, provided it has been reasonably maintained. For a standby generator, the critical value is the probability that the unit will not start, or will not pick up the load. The value for this application is 1.35 failures per hundred starts. Because this only happens when the primary supply fails, the improvement can be simply derived by multiplying the source outage cost (for long term outages) by 0.0135 and subtracting the result from the original cost. This annual savings is the value of a single standby generator.

Comparing this cost to the the generator’s annual carrying cost plus maintenance is an accurate assessment of the investment. For the hypercritical facility, multiple generators can be justified by successively multiplying the annual outage cost by 0.0135, until the improvement no longer justifies adding another generator.

Once again, the calculated result should include the electrical system required to bring the generator on line: namely, the throwover switch. Once again, the effort to provide reliability can create pitfalls. A typical automatic throwover is little more than two molded-case switches with a common actuator. These are among the least reliable switches. According to the IEEE Gold Book, typical 1,200-amp drawout-type breakers, for example, have a failure rate of 0.0030 per unit year. But the molded-case type has a failure rate of 0.0096—three times that of a draw-out type air circuit breaker.

Furthermore, unless a complex set of bypass switches is installed, such a system cannot be maintained without an outage. Even with bypass switches, reliability during maintenance is severely hampered. Fortunately, for a premium there are throwover switches available with integral bypass options that allow for proper maintenance of the throwover, without significantly degrading the reliability of the overall unit.

Taking Action

An assessment of each facility’s power reliability needs is a unique analysis. It depends on a variety of factors, including: base utility reliability, configuration of the plant electrical system and the nature of the critical loads themselves. In some cases, critical equipment can be “hardened” using UPS systems, or by employing latch-and-hold type contactors. But today many facilities are simply hypercritical—these plant cannot be shut down under any circumstances.

In any event, the means of assessing system reliability has changed little over the years. What has changed are the enormous costs of power outages. The rapidly expanding costs of power events have substantially changed the math for justifying backup systems, which may include multiple generators and utility supplies, and of course, UPS systems.

From Pure Power, Spring 2003

Average System Availability Index

The average system availability index (ASAI) is a measure of the amount of the time that the system is available to provide power. A reliability of 99.9999%, or six nines, means that of the 31,536,000 seconds in a year, the system was available for 31,535,968 of them, leaving 31.5 seconds unaccounted for. According to the ITIC curve, an outage of 1/5th of a second or more (20 mS) is all that’s required to crash an electronic device. By these measures, nine 9’s are required to yield an effective 100% reliability.

Utility Reliability

Once relegated to the back pages of annual submittals to the regulatory agencies, utility reliability measures are gaining new interest as critical information. Unfortunately, many of the commonly used statistical measures should be very carefully applied.

Most measures, such as SAIFI (system average interruption frequency index), and CAIDI (customer average interruption duration index) often assume a total outage that requires intervention to correct. Brownouts and momentary outages—those that are well within the data loss region of the ITIC or CBEMA curves—typically aren’t included. In some cases, where data is available on momentary outages, the values with and without are sometimes calculated.

Their exclusion is due to the inability to economically monitor the entirety of the utility system for such events and calculate an inclusive measure. The utility customers, however, are capable of evaluating these measures themselves, using SARFI (system average RMS frequency index) or MAIFI (momentary average interruption frequency index).

Typically SARFI is measured as SARFIXX, where “XX” is the level of undervoltage. For example, SARFI90 refers to the voltage sags below 90% of nominal for a typical month. The various levels of SARFI can be graphed and examined, as shown on the accompanying chart. Notice that no events plotted on the chart fall below 70%. These 70% to 80% RMS voltage events wouldn’t reduce the average system availability indices, but would likely cause a significant interruption in plant operations. With this type of detail in hand, additional sources can be evaluated for the hypercritical facility.

With electric deregulation—or where there is poor regulation—much of what contributes to good reliability indices for the utility is under pressure for elimination. This includes good maintenance practices such as tree trimming, deteriorated pole replacement programs, substation maintenance and periodic line re-sagging.

It is up to the plant professional to assess the standard reliability index, and make a determination as to what makes the most sense when evaluating reliability improvement measures.

Questions to Ask

What are the costs of outages, both long and short term?

What equipment is most vulnerable?

Are there secondary loads, such as communications rooms, that support the critical loads and are therefore just as critical?

Is the facility’s utility supply more or less reliable than industry standards?

Is the generation/UPS system adding reliability vis-à-vis the ancillary equipment, such as throwover switches, or actually reducing it?

Are maintenance personnel painting themselves into a corner by not providing adequate bypass options?

Do bypass options actually reduce reliability? (Consider failure rates of switching devices.)

Are any contingencies in the supply path being overlooked (i.e., single stepdown transformer on the output of a UPS)?

Is the facility prepared to maintain the generation system sufficiently to guarantee its reliability?

Is there capacity for future expansion of the critical loads?

Can one easily add more generation as reliability needs increase?

Does one need a four-wire or three-wire throwover switch? A generator is a separately derived source, whose neutral should be grounded in some fashion; the fourth pole of the throwover switches the neutral in with the phase conductors to prevent ground loops.

Has there been adequate review of the grounding on the UPS outputs? Inadvertent neutral-to-earth bonds have been known to crash entire auto plants by tripping the UPS off line on ground fault.