Reliability: A Critical Mission

The actual cause of last August's unprecedented power outage remains unknown. A hapless squirrel, a fallen tree branch or a worn-out insulator—all of these have been suggested as possible initiators in an event that cascaded to include much of the northeastern United States and eastern Canada. Even if an exact cause can't be pinpointed, there are some obvious lessons to be learned from th...

By Chuck Ross, Contributing Writer February 1, 2004

The actual cause of last August’s unprecedented power outage remains unknown. A hapless squirrel, a fallen tree branch or a worn-out insulator—all of these have been suggested as possible initiators in an event that cascaded to include much of the northeastern United States and eastern Canada. Even if an exact cause can’t be pinpointed, there are some obvious lessons to be learned from the incident. In the long term, more must be done to protect transmission system reliability. But more immediately, the great blackout of 2003 proved just how important backup and redundancy planning are, both to individual businesses and to the nation as a whole.

Of course, this isn’t a new lesson. The Sept. 11, 2003, terrorist attacks had already spurred many organizations to look at new ways to protect their business operations. And an April 2003 report, authored by representatives from several U.S. government agencies in the wake of the attacks, underscored how important maintaining financial operations is to overall national security. This growing awareness, especially among financial services companies, is resulting in more business for engineers who design mission-critical facilities.

Interest rising

“We’re seeing the dam gates opening, especially in the financial marketplace,” says Mark Welte, CPE, principal with New York-based EYP Mission Critical Facilities. “All the big banks seem to be active. Part of it is the latent demand that’s been put off.”

Welte notes several major drivers behind this increased interest, including a growth in dependence on information technology, larger cooling demands caused by higher watts-per-sq.-ft. loads and a push toward ever-higher reliability targets. Another factor boosting financial institutions’ interest is a white paper outlining strategies for ensuring recovery and resumption of financial systems following a catastrophic event (see “Feds on a Critical Mission,” p. 30). Co-authored by the Federal Reserve, the Securities and Exchange Commission and the U.S. Treasury Dept., the study emphasizes the crucial role reliable financial data plays in overall national security.

This reinvigorated interest in data center protection is proving beneficial to engineers who specialize in designing mission-critical facilities—a sector of the engineering community hit hard by the dot-com collapse. Speculative, standalone centers providing outsourced server hosting were a growth industry in the late 1990s, but demand never reached their developers’ dreams. Proponents predicted broadband Internet capabilities would create a need for such facilities, guaranteeing uninterrupted service while minimizing in-house maintenance costs, but that expectation hasn’t yet been realized.

“It seems that [most companies] can deal with their own in-house resources,” says Martin Konikoff, P.E., a partner at New York City-based engineering firm Robert Derector Assocs., of the approach most businesses have taken to e-commerce initiatives. However, his firm is one of many seeing increased interest from the financial-services industry—much of it driven by consolidation—to address business reliability concerns. Designers are helping newly merged entities bring together multiple data operations into a single, coherent plan.

“You automatically have a proliferation in information technology space, equipment and staff [with a merger],” says Tim Dueck, principal and director of mission-critical consulting for Minneapolis-based Mazzetti & Assocs. “When you do consolidation, you [move] five or 10 facilities into two or three.”

EYP Mission Critical Facilities is currently working on one such project for Bank One, following an acquisition that left the group with some 30 data centers across the organization. The new plan calls for consolidating these functions into three facilities. The company’s announced merger with J.P. Morgan Chase could mean even more rethinking.

“These three data centers for Bank One were in reaction to their consolidation plans at that time,” says Welte. “If Chase has a bunch of facilities, they’re probably facing the same kind of dilemma that Bank One was three to four years ago.”

Wider-reaching market

Financial-services groups might be leading the effort for increased reliability and availability, but engineers in the field are noting a broad interest across corporate America. For example, Konikoff’s firm is seeing law firms pay more attention to the business-continuity advantages that high-reliability design can offer.

“The general approach had been to provide an opportunity for their server rooms to perform an orderly shutdown,” he says, describing how most law firms have approached reliability in the past. Now, however, Konikoff says, these firms are looking at generator backup for server-room operations, to allow at least remote access to company data.

“In effect,” he says, “the legal community can have a little bit of business continuity, so long as their server room has backup.”

Other engineers are also seeing a broader interest in protecting ongoing operations, with companies from a range of industries taking a strategic look at overall needs. This can mean looking at new ways to help ensure that entire locations remain functional, or targeting specific business-critical servers for 24/7 protection and planning for orderly shutdown of non-essential units.

“Since the blackout, lots of people are talking about mission critical,” says Guy Despatis, director of engineering for HOK’s San Francisco office. “They’re not only concerned now about the data center; they’re concerned with keeping their business going. It doesn’t mean they’re all going to go out and buy UPS systems, but they’re at least talking about it.”

And, say others, the inventory of previously empty speculative space is drying up just as an improving economy is forcing companies to look at their own aging data facilities. This combination is raising the opportunity for new construction.

“We’re starting to see another round of building going on for corporate data centers,” Mazzetti’s Dueck says. “A lot of companies rode out the big boom—they’ve got older facilities. There’s been a clampdown on spending, and that’s [now] starting to break loose.”

Education a must

Despite their increased interest in protecting business-critical operations, clients still often require education on the complications and compromises these projects can involve. Servers have gotten smaller and less sensitive to environmental conditions, and many models now allow dual power supplies to make UPS connections easier. Such improvements can mislead clients into underestimating the full range of system requirements that a resilient data center can require.

“It’s a double-edged sword,” says Konikoff. “Yes, the equipment is getting smaller. Yes, the space to be protected is getting smaller. Now you have to bring a large volume of air distribution to a fairly small area.”

Others support the notion that electrical equipment improvements have had a lulling effect on client budget expectations. “I think there is some tendency to think about electrical first,” says Dave Troup, P.E., director of mechanical engineering at the San Francisco HOK office. “One example: They think about a UPS, but they don’t think about cooling the UPS room.”

And, as HOK’s Despatis notes, cooling requirements are on the rise, as server size has decreased. Smaller servers mean more servers in today’s facilities, boosting power densities and heat-rejection needs.

“I remember when it was 20 watts per sq. ft.,” he says. “But we’ve had labs [recently] that are up to 80 watts per sq. ft.” With such loads, he adds, “you need to bring the air right to the equipment.”

So how does the systems designer help clients balance needs and costs? Although clients may come to the table with a goal of reaching “five-nines” or “six-nines” availability, or of creating a Tier IV design, many engineers prefer to help their clients to first take a step back and focus on the project’s specific needs and budgets.

Troup and Despatis at HOK take a system-by-system approach to analyzing design alternatives. They recognize that any component is susceptible to failure, and they outline the options available if failure occurs. They develop matrices of systems, their potential for failure, estimated downtime in case of failure and the cost of varying levels of redundancy. In one facility dependent on chilled-water cooling, the chillers were considered critical. Following chiller dependencies backward, engineers also identified domestic water supplies, pumps and cooling-tower make-up water, among other elements, as crucial to ongoing operations.

“It didn’t occur to us to look at it from a ‘nines’ point of view,” Troup says. “We looked at the weak links and how to address the weak links.”

A holistic approach

Dueck agrees on the need to focus on project specifics instead of abstract definitions when it comes to mission-critical design.

“Fundamentally,” he says, “the levels of reliability are meaningless, unless you have a way of connecting a level of reliability to a particular design or topology.”

Dueck has developed his own, admittedly broad, tiering approach to help clients understand potential cost impacts of varying levels of protection. At the most basic level, non-redundant UPS systems protect connected equipment, because with the lack of more sophisticated backup equipment and procedures, the probability of processing functionality loss is high. Maintenance in this design will most likely require all connected loads to be powered down, and its cost, he suggests, will run approximately $100 per sq. ft.

The highest level of protection creates the lowest probability of system downtime possible within a given schedule or budget—a moving target, Dueck says. Such designs focus a great deal of attention on limiting the potential for human error and can run from $600 to $1,000 per sq. ft. The middle level of this approach is defined as essentially everything in between, with costs ranging between $150 and $450 per sq. ft.

Notably missing from this hierarchy is a connection to specific design parameters, such as specific numbers of UPS units or cooling targets. Instead, Dueck recommends a more holistic approach based on project characteristics. For example, in city-center trading operations, where space is at a premium and any downtime is unacceptable, high-density racks featuring integral cooling and multiple layers of system redundancy may be the only option. However, in areas where real estate costs are lower, clients may choose to put more of their budget into the space itself, lowering both equipment density and cooling requirements.

Second future looks bright

The Internet may not have created the much vaunted “new economy” yet, but it has boosted dependence on interconnected data networks. As individual businesses—and society as a whole—move forward, ensuring the reliability and availability of these networks will only become more important, experts say. “As the return on investment is improved, you’ll see more and more of this,” Dueck says. “Criticality of information technology is never going to go backward.”

Feds on a Critical Mission

A financial institution’s ability to recover and restore operation may be critical to more than just that company’s bottom line, according to a report released last year. Such maintainability, say co-authors from the Federal Reserve System, the U.S. Treasury Dept. and the Securities and Exchange Commission, is crucial to national security.

“The Interagency Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System” (available at

The report addresses means for ensuring rapid recovery and resumption of critical operations following a wide-scale disruption, which might also include the loss or inaccessibility of staff in at least one major operating location. It emphasizes that plans should incorporate either ongoing use or regular testing to ensure that continuity arrangements are effective and compatible.

The report was originally released in the fall of 2002 for comment. The final version is less prescriptive, based on industry reaction, but strongly suggests that core clearing and settlement organizations in critical financial markets aim to recover and resume operations within the business day in which a disruption occurs. Firms playing a large role in critical markets should aim to recover and resume operations as soon as possible after clearing and settlement groups are back up and running, also within the same business day.