The concept of redundancy – is it redundant?
Assumptions are often made that if you have storage, you need to use RAID. If you have moving parts like fans, the assumption is that you need a hot spare (or at least the ability to hot replace the part), and then it goes on to redundant network connections, warm standby systems and so on. So much redundancy! But how much should we really be applying? What is really needed?
The concept of redundancy – is it redundant?
For many years, IT architects have been designing systems around the limitations of the underlying systems. RAID was designed around disks that fail or are not fast enough on their own, servers would have multiple redundant fans to ensure that components had plenty of air passing over them even when some fans failed (after all, they are moving parts…). Networking was implemented often with additional parallel networks for out-of-band management and administration to enable access to systems that had high security needs or where network bandwidth might be affected by administrative work. Warm standby servers would be specified to be as identical to production, and be pre-installed and configured to take over if the production systems failed – because it was quicker than restoring from a backup.
However, why are systems still designed this way? Is it because this is what we have always done, or is it actually required?
Redundancy at the wrong level
What happens when a disk fails in a RAID set? If you look at some of the great articles on BAARF, they will explain what happens when data is trying to be recovered and rebuilt (spoiler: massive performance impact, and potential for further failure increases). The problem is that this is a solution for a problem which now is much less than it was – disks are not only more reliable, but they are also considerably faster. A single SSD unit (with SAS 3 or even with SATA 3.2) can easily outperform a multi-disk RAID set used in enterprises only a few years ago – in all parameters of throughput, seek time, lifespan and power consumption. I cringe every time I hear someone mention that they need to purchase three or more SSDs so they can set up RAID 5 – often in the context of a vFlash Read Cache disk or even just an ESXi boot disk!
Similarly, I see complex network designs for virtual machines with multiple network cards, one dedicated to data backup traffic, another for management, and two virtual NICs teamed together for “virtual machine network redundancy” – and then I see that traffic put down a single 10Gbps link in a blade chassis… Way back when we were all in the physical world, it might have been prudent to have a network card dedicated to data backup traffic, and a pair of redundant network connections (probably at 100Mbps!) to cope with network link failure, and the ability to manage the machine through a network interface that is dedicated to authorised administrators. However, in a virtual environment, you don’t need a dedicated network for backups (use VADP aware backup software to back up the data without touching the OS), you probably don’t need dedicated management networks because not only can you manage the VM without touching the OS at all, you can also access the console of the VM without needing to be on the same network as the VM (just need to be able to route to the host) – great for DMZ environments.
Where should redundancy be?
For years, people have been focussed on failure tolerance. Instead, with the new capabilities and technologies available, we should be focussing on failure prediction or failure acceptance.
Prices have plummeted for disks, networks, operating system licenses; and capabilities, speed and reliability have all shot up – even today’s cheapest bog standard devices are considerably better than even enterprise units from 15 ish years ago.
Network redundancy should be defined at the physical level (from the host to the network core), instead of at the OS level. Bandwidth is cheaper, density is more desirable, and virtualisation offers more opportunities.
Disk redundancy should be designed around the capabilities and limitations of the devices (i.e. no RAID 5 for SSD disks, unless you want lower performance and lifespan!), leveraging capabilities such as replication, snapshots, and the speed of VMware HA and application clustering.
So, the story is – plan for redundancy around the requirements of the user-facing application, and the capabilities of the technology, and not just a 25 year old technology “because that is they way things have always been done”.