Understanding changing data center metrics
There is a better way to assess data center behavior. Novel multidimensional metrics have been incorporated in data center standards and best practices
- Understand the importance of multidimensional data center metrics, comprising performance and risk.
- Recognize how multidimensional metrics enable a holistic understanding of data centers.
- Identify the challenges to measure data center performance and risks.
Data centers comprise information technology equipment and supporting infrastructure such as power, cooling, telecommunications, fire systems, security and automation. A data center’s main task is to process and store information securely and to provide users uninterrupted access to it. These mission critical facilities are very dynamic, equipment can be upgraded frequently, new equipment can be added, obsolete equipment may be removed and old and new systems may be in use simultaneously.
Our increasing reliance on data centers has created an urgent need to adequately monitor these energy intensive facilities. Data centers are responsible for about 1% of total electricity consumption. Environmental impact of data centers varies depending on the energy sources used and the total heat generated. The information technology sector has contributed to a reduction in carbon emissions in other sectors.
In the United States, it is estimated that for every kilowatt-hour consumed by the IT sector, 10 kilowatt-hours are saved in other sectors due to the increase in economic productivity and energy efficiency. The growing ubiquity of IT driven technologies has revolutionized and optimized the relationship between efficiency and productivity and energy consumption across every sector of the economy.
Data center metrics
Metrics are measures of quantitative assessment that communicate important information and allow comparisons or tracking of performance, progress or other parameters over time. Through different metrics data centers can be evaluated in comparison to goals established or to similar data centers. Variations or inconsistencies in measurements can produce a false result for a metric, which is why it is very important to standardize them. Most of the current metrics, standards and legislation on data centers are mainly focused toward energy efficiency.
Existing metrics fail to incorporate important factors for a holistic understanding of the data center behavior, including different aspects of performance and the risks that may impact it. This being the case, comparisons between data center scores with the purpose of evaluating areas of improvement is not an easy task.
Furthermore, there currently is no metric that examines performance and risk simultaneously. A data center may have high performance indicators, with a high risk of failure. Having access to risk indicators may work as an early warning system so that mitigation strategies are planned and actions are undertaken.
BICSI 009-2019: Data Center Operations and Maintenance Best Practices incorporates in Section 10.5: Metrics and Measurement the concept of multidimensional data center metrics, comprising performance and risk. Performance is assessed across four different sub-dimensions: productivity, efficiency, sustainability and operations. Risks associated with each of those sub-dimensions and external risks should be contemplated.
In addition, ANSI/BICSI 002-2019: Data Center Design and Implementation Best Practices indicates in Section 6.7: Design by Performance that data center performance can be examined by various factors, including the previously mentioned sub-dimensions. It also indicates that many existing metrics have been developed to measure these areas of performance and the need to address the risks that may affect it.
Figure 1 illustrates the concept of next generation of multidimensional data center metrics. A correlation between the different elements can exist; for the sake of the explanation, it is assumed that there is no correlation between different performance sub-dimensions.
Given some premises, a data center may be ideal at a certain point in time, but when conditions change that same data center may not be optimal. Following is a more detailed explanation of the metric.
Data center performance
Four sub-dimensions are used to assess data center performance: productivity, efficiency, sustainability and operations.
Productivity: Productivity gives a sense of work accomplished or “useful work,” which can be understood as the sum of weighted tasks carried out in a period of time. Examples of tasks are transactions, amount of information processed or units of production. The weight of each task must be allocated depending on its importance.
A normalization factor should be considered to allow the addition of different tasks. Key productivity indicators include the ratio of useful work accomplished to energy consumption, physical space and costs. Costs usually include capital expenditures and operating expenses.
Downtime and quality of service must also be considered, as they affect productivity. A measurement within this category may calculate the impact of downtime on productivity, measured as the useful work that was not completed as well as other indirect tangible and intangible costs due to this failure. Quality of service measurements can include variables such as maximum or average waiting time, latency, scheduling and availability of resources.
Efficiency: Efficiency has been given substantial attention due to the high energy consumption of data centers. Many metrics have been proposed to measure efficiency. The most widely used efficiency metric is power usage effectiveness to assess the site infrastructure efficiency, through the ratio of the energy consumed by ITE to total energy used by the data center.
There are additional examples of key efficiency indicators. ITE usage metrics (e.g., power, processing capacity, central processing unit, memory, storage, communication) promote efficient operation of IT resources. Physical space usage metrics promote efficient planning of physical space. Other key indicators can gauge how energy efficient ITE, power systems and environmental systems are.
Sustainability: Sustainability can be defined as development that addresses current needs without jeopardizing future generations’ capabilities to satisfy their own needs. Nowadays sustainability initiatives are gaining substantial attention. Companies such as Google, Microsoft, Facebook, Amazon and Apple are undertaking significant efforts to reduce greenhouse gas emissions, to become carbon neutral or negative and to at least match electricity consumption with renewable energy.
Examples of key sustainability indicators include the ratio of green energy sources to total energy, the carbon footprint and the water usage. In addition, an evaluation may be conducted on how environmentally friendly the related processes, materials and components are.
Operations: Key operation indicators gauge how well-managed data centers are. This incorporates an analysis of the maturity level of operations and processes, including site infrastructure, IT equipment, maintenance, human resources training and security systems. Audits of systems and processes are necessary to collect the required data. This data should include factors such as documentation, planning, human resources activities and training, status and quality of maintenance, service level agreement and security.
Risks for data centers
Data center performance cannot be completely evaluated if the risks that may impact it are not considered. End-to-end resource optimization must involve risk. Risk can be defined as uncertainties or potential events that, if materialized, could impact the performance of the data center.
For our purposes the risk level is defined as the product of the probability of occurrence (Po) of an event times its impact (I), normalized to the desired scale.
Risk = Po x I
The user may implement actions to achieve the optimum performance and later adjust that performance to a tolerable level of risk, which may again deviate the key indicators from their optimum performance. The acceptable level of risk should consider the risk appetite.
Risk associated with the sub-dimensions of performance, as well as external risk, which usually is independent of performance, are also measured through the use of metrics. A common strategy to reduce probability of failure is redundancy of resources, but it may affect performance and costs.
Risks associated with performance
Productivity risk: It should consider present and past data for parameters that may affect the useful work such as downtime and quality of service. The impact may consider the useful work that was not completed properly, as well as related tangible and intangible costs.
Efficiency risk: If resource usage is close to or at capacity, it means that the risk of future projections not being met is high. The ratio of processing, IT resources, physical space and power usage, to their respective total capacities should be factored in.
Sustainability risk: Analysis of historic behavior of the different green energy sources, the composition of each energy source and its probability of failure should be assessed.
Operations risk: Analyses of historical data are needed to estimate the probability of failure due to improper operation in the areas identified and its impact.
External risks: Site risk
A data center site risk metric is a component of the multidimensional data center metric. The methodology of the site risk metric identifies potential threats and vulnerabilities (risk identification), which are divided into four main categories: utilities; natural hazards and environment; transportation and adjacent properties; and regulations, incentives and others. The allocation of weights among each category is based on the significance of each of these factors on the data center operation.
The methodology quantifies the probability of occurrence of each event and estimates potential impact (risk analysis). It calculates the total risk level associated with the data center location by multiplying the probability of occurrence by the impact of each threat. That product is then multiplied by the respective assigned weight and normalized.
Through this analysis the different threats can be prioritized. Understanding risk concentration by category facilitates analyzing mitigation strategies (risk evaluation). This methodology provides solid guidance for risk assessment of a data center site.
To enable cross-comparability, all the different indicators should be normalized. For key performance indicators, a higher value implies a more positive outcome, so minimum and maximum values correspond to the worst and best possible expected outcomes. Conversely, for key risk indicators, a higher value implies a higher level of risk, therefore a less desirable scenario.
Spider graphs allow visual comparisons and trade-off analysis between different scenarios. This is helpful when simulating or forecasting different strategies or reporting to stakeholders. Figure 2 shows an example of data center comparison. Edges of diamonds show measurement of the four dimensions of key performance indicators: productivity (P), efficiency (E), sustainability (S) and operations (O). The larger the diamond, the better the performance. Risks can be analyzed in similar spider graphs.
Data center automation
Data centers are evolving toward digital and intelligent infrastructures. With the ubiquitous presence of sensing devices, IoT and new technologies, it is easier to automate the process to collect, in real time, different parameters to assess metrics. We must understand the relevant data to be gathered rather than simply collecting more data.
Such massive amounts of data can be used for predictive analytics, to visualize trends and behavior across time. New tools including artificial intelligence/machine learning contribute to improve the prediction process.
Lastly, we can use it for prescriptive analytics, to generate prioritized actionable recommendations. We should not forget that data center end-to-end resource management is an iterative process.
The multidimensional data center metric incorporated in data center best practices, allows a comprehensive assessment of the data center, combining performance (productivity, efficiency, sustainability and operations) and risks (associated with performance and site risk). It allows to rank data centers, to make comparisons between different data centers and to measure and compare before and after as well as different scenarios of the same data center. It is flexible, thus having the possibility to select and update performance and risk measurements.
Actions undertaken may impact the metric results in real time. In such cases, when variables are remeasured, the result of the metric should change. That way, implementation of new strategies may lead to the modification of the overall data center performance and risk to a more desirable score.