Should you waste your money on the recovery of your computer center?

by Andrzej Gorecki

[As published in True Blue, August/September 1992]

According to Andrzej Gorecki, disaster recovery plans don't have to cost a small fortune, as long as managers separate their business functions from the computer center and choose the right recovery plan.

One of the few areas causes little controversy in the world of business is the need to talk about computer system disaster recovery. The topic is fashionable — the large consulting firms would love to prepare a disaster recovery plan for you. Yet few organisations actually do something about it. The reason people usually stop at the talk is simple: preparations for disaster recovery are considered to be too costly.

Without doubt there is a need to plan for computer system emergencies and recovery — but disaster recovery arrangements are often misdirected.

There should be a distinction between preparations for business functions recovery and for computer center recovery, since the latter is often an unnecessary expense.

What is universally considered to be the best solution is available only to the big boys who can afford to fully duplicate their computer systems. The next best things are warm and cold sites, which are still expensive but more affordable. There are organisations, which opt for such "low cost"; solutions, but the costs are still massive.

So, many organisations simply elect to do nothing.

The truth is that preparations for disaster recovery need not be expensive. Every business can and should prepare for a system disaster. The key is to make disaster recovery preparations affordable. To do so one needs to change the disaster recovery paradigm.

When one considers a total loss of a computer system, a typical reaction is to plan for a reconstruction of the system. What can be more logical than that? Surprisingly, for most businesses, this is a wrong reaction.

The reconstruction of the computer system means a simultaneous restoration of all business functions. What may come, as a surprise to some people is the fact that in most businesses only a few functions are truly time critical. The majority of business functions can survive for days or even weeks without a computer system. So, why recover them instantaneously (or thereabouts) at a monstrous expense if they can wait?

Labelling Disaster

Business functions fall into one of five categories as far as disaster recovery is concerned:

Category I — Instantaneous recovery (no system failure is acceptable). Examples include systems managing life monitoring and support systems (hospitals) and systems controlling nuclear reactors.

Category 2 — Rapid recovery (within minutes). Examples include banking systems (needed for online customer service) and retail point of sale systems.

Category 3 — Fast recovery (within a few hours). Examples include real time warehouse management systems and air- line seat reservation systems.

Category 4 — Medium pace recovery (within a few days). Examples include online library management systems and pay- roll systems.

Category 5 — Slow pace recovery (within a few weeks). Examples include general ledger systems and fixed assets systems.

Obviously some of the above examples can be moved a category up or down, depending on specifics of the business. A good criteria for qualifying business functions into the categories is the maximum time without the computer system before there is damage beyond repair. By such damage one needs to under- stand loss of life (or health), massive loss of property or monetary loss in excess of three months net profits.

The cost of disaster recovery preparations (one-off costs and on- going costs) vary according to the category. They fall between nil (Category 5) and the total system cost being duplicated or even triplicated (Category 1). But there is also the cost of the system restoration, which comes on top of the preparation cost, irrespective of category.

Making Preparations

Thus, the fundamental issue in preparing for disaster recovery is not how to restore the computer system quickly (ultimately this needs to happen, preferably as per Category 5 to minimise the cost), but how to prepare for the disaster on a function by function basis.

Possible recovery preparations, depending on the Category, are as follows:

  • Category I — Instantaneous recovery. Fault free systems and physically separate hot sites. As a standard triplicate parallel systems are considered to be sufficient for mission critical applications.
  • Category 2 — Rapid recovery (within minutes). Fault tolerant systems. Hot sites. Function-specific specialised small computer systems can move the business function to a lower category. But note that if the business is lost together with the system, the rapid recovery is no longer required; for instance, a point-of-sale system needs to be fault tolerant but does not have to be restored quickly if the store bums down together with the computers.
  • Category 3 — Fast recovery (within a few hours). Warm site. Function specific specialised small computer systems can reduce the recovery requirements of the functions. Move the business function to a lower category.
  • Category 4 — Medium pace recovery (within a few days). Cold site. Bureau services and function-specific specialised small computer system can move the business function to a lower category.
  • Category 5 — Slow pace recovery (routine system setup) (within a few weeks). No special response. The system is simply rebuilt as it was originally installed within a number of weeks.

Proper disaster recovery must be driven by business functions. For example, there is no need for an instantaneous recovery of a General Ledger system in a nuclear power plant.

The steps to prepare for disaster recovery are as follows:

  1. Evaluate risk factors influencing your computer installation.
  2. Quantify each of the factors (use likely threat frequency statistics).
  3. Establish the combined probability of the total loss of the installation.
  4. Identify all business functions, which are computerised.
  5. Determine the length of penalty intervals for each of the functions in the case of the system being unavailable. The intervals are no penalty (i.e. negligible losses for e.g. up to 2 days), low penalty, medium penalty, and maximum penalty (loss of the business).
  6. Develop a matrix to determine which business functions fall into the Categories from 1 to 5.
  7. Design and put in place recovery arrangements separately for each of the business functions.

Every business (as a minimum) must perform the risk assessment. This is needed to manage the risk. The risk can either be borne by the business or it can be contracted out in the form of insurance — but this only provides for monetary loss, not the loss of data and time which is usually translated into dollars for the purpose of risk management. The insurer will usually refuse to provide the cover unless the assessment has been completed.

If the business decides to manage the risk internally then it needs to prepare for the recovery of each business function. Since business functions usually fall into Categories 2 to 4, an effort needs to be made to reduce the recovery requirements of the functions.

Well-designed recovery arrangements bring all business functions into the lowest category possible. This makes it possible to move recovery requirements of the computer installation itself into a lower Category, equal to the highest Category amongst the business functions.

Ideally, all business functions should be reduced down to the Category 5. This can be achieved by use of bureau services, decentralised computer systems, and function-specific specialised small computer systems. Use of the bureau is self-explanatory. Decentralised systems make it possible to use a computer on another site to run the most critical applications, at the expense of those, which fall into category 5. The function-specific small computer systems are developed on PCs and they contain a cut down version of the system they are supposed to replace in the case of emergency.

Using the approach of reducing business functions to Category 5 and by using specialised (fault tolerant) equipment for those business functions which belong to Categories I and 2, one can practically eliminate the need to develop recovery plans for the computer installation itself. In the case of a disaster it just needs to be rebuilt in the normal course of business.

So, do not waste your money on the recovery of your computer systems, prepare yourself to restore your business functions instead.

Andrzej Gorecki is a Director and principal consultant with Melbourne-based Retail Directions Group, which develops and supplies state-of-the-art software solutions for retailers worldwide.

Copyright (c) 1992 Andrzej Gorecki

Top