The silver bullet for a perfect DR plan

Posted on: 21 February 2014 16 September 2022
Categories: Business, Disaster Recovery, User Guides, War Stories
Tags: Disaster Recovery, Strategy

Without exception, every organisation I have worked at (or for) has had concerns over coping with the disaster of data loss or being unable to use their systems for business. It’s not surprising as it is reported that 60 percent of businesses that suffer data loss will end up failing completely within 6 months. That is why we all need a good DR Plan.

Each business has it’s own tolerance and interpretation of disaster – I have worked in emergency services where if systems are unavailable – people die, right down to businesses who decided (or thought) they could be down for 2 days without major impact to business. However, every audit I have done (or seen) points out that a business has insufficient DR planning documentation and testing.

What I have learnt through being involved in many projects, tests, rehearsals and paper exercises is this – the weak point is people and process, not technology.

Mistakes with the concept of a DR plan

Let’s start with some basic points, or call them mistakes, with a DR Plan;

Backup is not Disaster Recovery
Disaster Recovery is not Business Continuity
Business Continuity is not the responsibility of just the IT department
A backup is pointless unless you can actually do a recovery –partial (files) and complete systems
RAID is not backup, snapshots alone are not backup, replication is not backup
Your documentation is likely to be insufficient (for a layman to follow), incomplete, not completely accessible offline (when your datacentre or authentication is down) or out of date
No software, or hardware, or outsourced solution – will provide all the answers to all the problems (there is no silver bullet)
In a real ‘disaster’, it is highly likely that it will take longer or be harder than expected or planned

READ ARTICLE: Lessons from the CrowdStrike incident

The anatomy of a disaster

So, what is a disaster? Can you cover it all in your DR plan? Interestingly, the interpretation of a disaster varies between different businesses. Here’s some incidents and scenarios that you may, or may not, consider a disaster;

Hardware failure that causes system unavailability (i.e. more than losing a redundant component)
Fire, flood, earthquake, extreme weather, power feed loss, or other loss of a building/datacentre
Hacker attack – targeted (e.g. spearphishing) or untargeted (script kiddie, DDOS, malware)
Malicious or accidental data deletion or corruption or over-writing
A bad OS patch or update, or administrator misconfiguration (including malicious activity)
A bad anti-virus patch definition or application update that disables or removes functionality
A failure or problem with your power supplier or Internet provider or phone provider – or the security system that lets you into your building
A disaster recovery test affects or brings down production systems

War stories from the front line

I will elaborate on some points with some examples and scenarios that may help you think about what unexpected situations can occur.

One of my customers had a loss of 2 shelves of disks due to an accidental release of a fire extinguisher system. There was no fire, and they also lost other redundant components in servers.

One of my colleague’s customers had a datacentre outage when a truck crashed into a local power substation, three streets away from the datacentre – the truck was delivering fresh diesel for the generators for the datacentre (diesel cannot be stored forever, it needs to be ‘fresh’). So, not only was the main supply out, the generator had no fuel…

READ ARTICLE: Disasters in disaster recovery

Impact from hacking attacks is not always targeted – “script kiddies” will scan any available address and look for vulnerabilities, probing them no matter if you are a domestic user or a multi-million dollar business. Don’t think that your obscurity, small size or business type will protect you.

Accidental deletion

In my very first job, I was responsible for preparing a marketing database every month before it was exported and sent to customers. Part of this process was to delete the “notes” tables that were used by the business in the gathering of the data (like contact details and history, internal notes that we would not want customers to see). I accidentally deleted the notes from the live database and not the copy made for export. It was not possible to restore just one table and so the business lost 2 days of work.

I worked at a company where we needed to execute scripts from our core LOB product, so we made the scripts executables with a free online tool. It turned out that a year later someone wrote malware using the same free tool, and the anti-virus software signatures then decided to delete all our scripts, crippling the business. We had no idea it was the A-V software until the third restore failed to re-create the files, as they were being deleted as soon as they were restored.

In another job I had, an administrator deleted an AD OU that contained 80,000 groups – but in Microsoft AD the group membership is stored in the user and not the group, so every user needed to be restored – from multiple other OUs, but not the whole AD. Then they found out that their backup software could not run in Safe Mode (DSRM) to recover AD with a partial authoritative restore because the services could not log on to start up.

READ ARTICLE: The difference between BCP and DR

Restoring into an invalid location

During a hardware upgrade of a physical Exchange server, restoring the database to a new disk completed, however it would not mount. After 6 hours of investigation and working with Microsoft support, it turned out that the drive letter of the new disk (d:) did not match the backed up database (e:).

Similarly, when testing recovery of a Windows 2000 server full backup onto alternative hardware, the backup software required that a fresh copy of Windows be installed on the new hardware, then the backup agents installed and registered with the backup server. After a full system restore was completed, a reboot resulted in a unbootable and corrupt server. It turned out that the original Windows 2000 installation was an in-place upgrade from Windows NT – the system was installed as c:\winnt and the fresh install used the default c:\windows directory.

N-1 update strategy

Upload files with the vSphere Web Client

Create AD login for vC Ops access

The importance of diversity

Share this knowledge

Christian Wickham

181