Disaster Recovery Planning – How To
Disaster Recovery is the process by which an organisation recovers its business operations after a disaster. It often is a focus of the IT department, however it is a business responsibility.
Business Continuity Plans are the ability for an organisation to know how they will continue to provide business operations during an incident. It is not an IT responsibility to tell departments how they should do their business during a disaster, but IT may provide services and facilities that the business can use during the disaster. BCPs can also include policies, runbooks and approaches.
Service Resilience is a subset of Business Continuity, where technologies and services may be designed and architected to ensure that services can continue to operate when components are degraded. Clustering, load balancing, distributed systems are common, and “design to fail” is evolving as an approach.
Analysis
The current usage of systems will require analysis, to find key parameters that will influence the most appropriate handling of Disaster Recovery Planning (DR), Business Continuity Planning (BCP), and Service Resilience design.
- Where is data kept? Where are systems and services hosted?
- On-site? Co-Lo/hosted datacentre? Cloud? SaaS? Is there a full inventory?
- What are the most important data assets and systems?
- Are they prioritised and documented? (Gold/Silver/Bronze etc.)
- Is there an RPO (Recovery Point Objective – amount of data that can be lost) / RTO (Recovery Time Objective – when systems should be up) for each system and service?
- RPO/RTO cannot be shorter than the time to recover a system – if shorter times are required, more investment will be needed to create standby / duplicate systems. Tape restore is the slowest.
- Is there a Board / Exec approved MTD (Maximum Tolerable Downtime)?
- Is there a failover or DR site?
- If the idea is to use Development systems during a disaster, what happens with development data and work during the disaster?
- Is there a fail-back plan?
- Are there any redundant systems (clustering, load balancing, standby systems) available?
- Is there an existing DR plan?
- When was it last tested and updated?
- Was it a table-top exercise, or a real system failover test?
- What were the results? Have any improvements been made as a result?
- Is the DR plan accessible off-system (i.e. on paper or off-network?)
- Has the DR plan been tested with non-staff people?
- When was it last tested and updated?
- What is the regulatory environment or external obligation for the business to comply with?
- If there are contracted services, have the contracts been reviewed for currency?
- Have 3rd party suppliers verified that their DR plans have been tested?
Policies and Procedures
Disaster Recovery Planning is not just a technical exercise. Most work is in documentation, design/planning and creation of policy and procedures.
- All systems and services need to be documented.
- System configuration – in case the system needs to be re-implemented (e.g. SaaS services, re-installed software etc.). Include customisations or special patches
- Systems interactions and integrations – where data is stored, gathered, and provided
- Security and access permissions
- How important is this to be business? This indicates recovery order
- Dependencies and antecedents – what needs to be available first before this system is recovered (such as Active Directory / Entra, databases etc.), and what will depend upon this system (web servers, BI systems, client facing systems)
- Business functions that depend on this system – will this system being off-line affect clients / billing / human lives / business obligations?
- What is the Recovery Point Objective? Can they lose data since the last backup?
- What is the Recovery Time Objective? How soon do they want it back up and running?
- Who is the owner? Who needs to be contacted? Who is the administrator?
- Are there any 3rd parties or contracts in place?
- Business Continuity – how will the business continue to operate during an incident?
- Who needs to be informed?
- Does a website announcement need to be made to customers? A Twitter or LinkedIn announcement?
- Are the contact lists up to date? Are there alternative contacts?
- Who has the authority to “declare a disaster” and initiate a DR plan?
- Is there an escalation hierarchy? If a manager is not available, does a director get notified? When does the CEO need to know?
- How will the business communicate to suppliers and external 3rd parties (imagine email is down) – is there a phone list, or will an off-system email be used?
- If an email (such as [email protected]) is used, are suppliers aware that this is a valid DR email address, or will they treat it as a scam?
- Are there any paper or manual processes that the business can follow, to continue to operate?
- How long can the business continue to operate on paper or manually?
- How will data be re-entered into the system when it is available again?
- Are there any obligations or compliance that is required – quality, security, avoiding duplicates etc.
- Is Fail-back and operating whilst in disaster mode understood?
- If systems and services are in a recovering state, performance and functionality may be affected – does the business know how to cope with this?
- Will all development and testing work stop during a disaster? What happens to all the developers who are paid by the hour – and is their un-finished work just deleted?
- If there is a failover to a DR capability, and another disaster hits (think virus re-infection), what happens?
- Does the business understand there will be another impact when systems are moved from the DR capability back to the production systems?
- How will data and transactions created during a disaster be re-integrated?
- Is there a process for lessons-learned and continuous improvement?
- Can quality be improved? What went wrong (something always will)?
- What external support and contracts are available?
- Who needs to be informed?
Audit and Review
Technology never stands still. Business never stands still. Changes occur, and documentation is not updated. You need to check, review, test and validate your plans.
- When was the last time this was reviewed?
- Has the organisation-wide Disaster Recovery Plan been tested?
- This can be a virtual table-top run through, where a scenario is provided, and each manager has to explain what they would do.
- This can be a technical run through where systems are failed over or backups restored.
- Have the per-department Business Continuity Plans been tested?
- Has this been over-seen by a person who is not in that department?
- Have security and contact details been checked for completeness and accuracy?
- Are paper/external records up to date?
- This can be a “battle box”, or folder of all DR / BCP plans, or a USB drive, or a cloud site that is not accessible by internal user accounts