9 big mistakes in disaster recovery planning (DRP)
What is the difference between DR and BCP? For that matter, what is the difference between Backup, Disaster Avoidance, Disaster Recovery and Business Continuity? It’s a mistake to consider all of these as lumped into the same concept, as that leads to further mistakes such as skipping steps.
Mistake 1: file backup is not recovery.
It’s all OK – we have a backup!
Well, no. Unfortunately, some still make the error of thinking that backing up all their files to tape is enough to ensure that computer systems can be recovered in the event of a problem. The more enlightened and experienced IT Managers will acknowledge that a file-level backup is only one small tool in the overall challenge of being able to recover a business.
A backup to tape has benefits that need to be recognised for their merits;
- Archive and data retention – being able to get back data from weeks, months and years ago
- Individual folder and file recovery
- Large data capacity for relatively low cost
However, not all backups are to tape. File-level backup to any media such as to disk or the cloud will only allow recovery of files. A functioning system is more than the sum of it’s files.
Just simply taking a backup of a system does not mean that it is recoverable. Even with backup agents that allow open files to be backed up, this does not guarantee that they are in a state that could be recovered and usable.
Mistake 2: people don’t test their backups.
You need to test recovery of individual files from backups (I recommend trying to restore files that are deep in a folder tree – deeper than 260 characters long), test recovery of database files or other files that were open / in use (one that often catches people out is Microsoft Access databases), recovery of large files and files in areas with strict security permissions. This should be done regularly (weekly), and every restored file tested, as just seeing that an MDB file has been created after a restore does not mean it is in a usable state.
Mistake 3: people don’t test system restores.
You need to test if you can recover an entire system – that is, Operating System, application, data and functionality. You need to do this in a separated network (unless you want IP conflicts etc.), and for a multi-tier system, you will need all servers that are involved – probably including Active Directory too. You should then test
This is advisable to do at least every 6 months, or after a major system change (like an upgrade or change in architecture). VMware has tools to assist with creating isolated networks.
Mistake 4: people don’t factor in time for their recovery.
Consider this timeline;
Disaster happens. An engineer identifies the problem, and notifies a manager. The manager investigates or validates that the disaster has occurred and contacts senior management, who decide to declare it as a disaster and to activate the disaster recovery plan. The authorisation is then passed down and engineers are mobilised.
Let’s pause for a second there. The time to decide if it is a disaster could be extended (depending upon the time of day) by needing to contact people and get authorisation – after all, you don’t want an administrator having the ability to make a decision that could be potentially damaging to a business. These authorisation stages can take an extended amount of time – deciding if the issue warrants a recovery or an attempted fix. It may be decided that a fix or workaround should be attempted first – how long will the fix be attempted before it is considered better to restore to a known good state? Who makes that MTO (maximum tolerable outage) decision?
Backup tapes are then obtained and hardware and resources are made available for restoring the systems.
How long will it take to get the off-site backup media into the recovery location? What if the disaster affects roads or the ability to obtain access to the media from an off-site location? Do you have sufficient hardware available to commence the restore, and will it support the full load of systems that are needed to run the business? Is there sufficient compute capacity, data storage capacity, power, cooling? If new hardware has to be purchased, do you have financial systems to pay for the hire/purchase of the new equipment? If old equipment is used, will the restored systems be compatible? Do you have license keys and installation media for the backup software and the hosting hardware (vSphere licenses, Microsoft licenses – etc.) How long will it take people to get to the recovery site – and if it is a natural / wide impact disaster, will staff prioritise the business over their family needs? Is documentation up to date and sufficient for people to be able to follow to perform the recovery steps? Does this documentation contain critical information such as passwords, or are these held by other people?
Data starts to be restored on to the disaster recovery systems.
This is the big delay – particularly when using tape. What is the order that systems need to be restored? Can your data backup be used to restore multiple systems at the same time? How long does it take to perform a tape restore of one system (bear in mind that the tape needs to be catalogued, rewound, ejected, changed to tape 2 etc.) – what about several systems at once (optical disk based restores will be considerably slowed by multiple concurrent accesses). Is your disaster system data recovery hardware and software at the same level as the production systems – are they compatible, and will there be errors relating to missing updates etc.? Do you have SDLT at the DR site, but LTO at production? If your DR site is also running development or test systems – what happens to all that expensively produced development data?
Mistake 5: recovered systems need to be able to be backed up when in the DR site
Ok, so you have managed to restore all your production data. You may even have managed to get your systems up and running and accessible by your users and your customers. So, new data is being created, and this needs to be backed up too. The DR site may be running fully, at DR load (that is, core and critical systems) – but now you need to ensure that all new data created at this site is also backed up. This is often missed, and after completing one disaster recovery, you need to trust that there will not be a second one (think; virus, hacker, malicious employee, untargeted hacking attacks, spreading natural disasters).
Mistake 6: forgetting that return to normal service means doing the DR all over again – in reverse
Doing a failover to DR hurts. It hurts a lot, and can take a long time – during which there could be an impact to normal operations. There will be a time when you need to move all operating systems out of the DR location and back to the original recovery site. How will you do this? Can you ensure that remnants of the original production systems are completely cleared out, and not going to interfere with recovery (think; files cannot be over-written and an old pre-disaster database starts replicating). Have you documented the steps to fail back?
Mistake 7: your documentation is out of date, incomplete or insufficient.
Notice that I worded it that your documents ARE in need of updating, not that they may be. It’s almost automatic that auditors who investigate DR plans will identify that they need to be improved. DR plans need to be available in hard copy and in a Cloud storage location that is accessible by people after a disaster (that is, not in a folder on top of your servers!). The documents should be updated after every system change (including, but not limited to; architecture changes, password changes, upgrades, backup process changes – the list goes on).
Do your documents have contact details for the system’s subject matter expert? What about the application owner who has responsibility for the quality and output of the system? What about contact for external/vendor support, including contract numbers and contact details? Is there a comprehensive list of recovery steps, progress checks, including what actions to take if one step fails? What about alternative recovery methods or re-build steps? How do you know if the system has been successfully restored, and who needs to validate this and who needs to be notified?
Mistake 8: your skilled staff may not be available.
This one catches people out. If you have done a DR recovery test – was it with the staff who are experienced and skilled in the system? Could you get an [IT] person off the street and have them successfully restore and validate a system? Is your documentation clear enough that it does not leave questions, or opportunities for an invalid or destructive restore? Does your DR documentation assume an understanding of your environment and architecture? What is the criteria for bringing in external people to do the restore – what skills do they need, and will they be available?
Mistake 9: customers are often not communicated with.
Technical people often forget this one. If your customers are unable to use your services during a disaster – who tells them, and what do they tell them, and how? If your primary consumable product is a website that is now down – how do you notify customers that services will be returned as soon as possible and that alternatives are available for the business to still generate revenue and retain customer loyalty? Your website is down, so how do you put up a page in it’s place? If you have failed over to a new IP address, how do you update DNS fast enough for customers to access the DR system? Look for BGP, GLBS, short TTL and other solutions for this.
Your consideration of customers must also cover internal users of IT systems – if your email system is down, how to people communicate, and how do you retain communication records? Do people need PCs, or can they use VDI – and if so, how do you communicate how to access the DR systems? If your phone system, your ordering and billing systems, your asset and inventory systems are down – how do you continue business?
The difference between Disaster Recovery and Business Continuity
Hopefully by now you will have realised that there are two very different considerations – one is how to recover IT and business systems so that data is not lost and systems can return to service, and the second is how do you continue to operate your business during and after a disaster (and the fail back to production).
Disaster Recovery Plans (DRP) need to cover concepts such as RPO (recovery point objective – how much data to lose) and RTO (recovery time objective – when to get things up and running by), and all the above points, but also need to be in the context of the business.
A Business Continuity Plan should not be to simply turn to the DRP and implement it. What about paper/manual processes? How can you continue to operate when systems are not available, and then after everything has returned to normal, how do you enter all the information that you generated or gathered manually? Are your staff trained in the manual process – and was this process updated when your IT systems were updated?
I would love to hear your comments below.