The Scream Test
We have all had it before – something that has been around for a long time, and we have no idea if it is important any more – but no-one will take the ownership to declare that it can be disposed of. This is where the Scream Test comes in.
The Scream Test is simple – remove it and wait for the screams. If someone screams, put it back.
The Scream Test can be applied to any product, service or capability – particularly when there is poor ownership or understanding of it’s importance. In the server infrastructure world, there are often Zombie Servers, which are alive, but probably better off dead – they are consuming resources and licenses, possibly being patched and monitored, but no-one is using them.
Old servers that no longer deliver services are a prime candidate, and the most common recipient of the Scream Test. What happens is that the business owner of an application does not really care what happens to the old systems when they are replaced with a newer version (or product), but at the same time, they can’t make a decision if the old system needs to be retained “just in case”. Those old systems are still being backed up, still being patched – to stop them being a risk vector for other parts of the network. As the server team / infrastructure group have a focus to ensure maximum uptime for all systems, they won’t switch something off until they are requested to do so.
How to perform a Scream Test
That is why the decision to do a Scream Test is not taken lightly. Plus, the process in a Scream Test is to switch off / disconnect / disable the system for a few weeks then progress to standard decommissioning processes (such as removing the system from the domain, un-registering licensed components, etc.). The process of a Scream Test will also be in the understanding that there is a roll-back (switch it back on or plug it back in) during the Scream Test, and a recovery (restore from a backup) path for time after the system has been removed.
- Notify any people who may have an interest in the system. A carefully worded email that encourages active investigation of the usefulness of the system will be required. Try to ensure that the “lazy” option is to do nothing, instead of being told that the system must be kept.
- Validate that there is a backup or rollback available.
- Disable monitoring or active alerts that would be triggered by a system outage.
- Power off or unplug the system. Or stop services. Or change access permissions. Or slightly rename a folder – depending on what you are trying to check.
- Is there a complaint? Has something else stopped working?
- Depending on the type of system and what it might have been used for – wait an appropriate amount of time. If it’s financial, wait for month-end, if it is related to any other business schedule, wait for a cycle.
- Switch it on and then perform proper decommissioning (remove from domain, un-register from backup or other licensed systems, etc.)
Other Scream Tests
It’s not just servers that can be treated this way to identify if anyone is actively using them. A fileshare, or files that have not been touched for 5 years, databases that appear to not have been accessed, a printer that keeps jamming – really anything else that passively consumes cost or maintenance effort. The principle is the same for testing it – disable access first, before removing it and then ensure that it can be recovered for a while after (even if the recovery may take a day or two).
With the explosion of virtual sprawl, you may have had a project request that they have a full suite of systems; an exact clone of all production for development, testing, user acceptance testing, integration testing, staging, etc. etc. – In my experience, many of these are created and then not used. In the world of Cloud, these systems should be created on demand, and then torn down between each release cycle.
Outside of the IT arena, it can be done with anything that is considered to be obsolete and no longer used/needed, but no-one can give the final approval to remove. A feature or capability in a product? A decorative vase in your bedroom? A menu item? An approval step in a workflow?
The miserable history of Scream Tests
Why do we need to do this at all? It should be an active part of any project to ensure that there is cost recovery and avoidance by removing old systems (particularly when the new system has been moved to the Cloud), but it is frequently not done. It is a mixture of a focus being on the new and shiny replacement, and also a risk aversion over taking the responsibility to get rid of an old system.
When an old system is replaced, often the original engineers, project managers and business leaders have moved on – and there is a concern over the unknowns of what will happen. Is there a legal or financial obligation to retain the old system? What if… just in case… maybe… there might be something important on that system, and no-one wants to be the person who made the decision to get rid of it.
Effective project life cycles should exist, but we all know that intentions do not always lead to actions. It is likely to get worse, with a shift of focus from servers towards business functions and consumption based services – where there is less consideration of the requirement to remove old services. This can only be improved with more diligence and focus on removal of inactive services – but when it has little cost to ‘the business’, but overhead on administration and management of services/systems/infrastructure, the problem of zombie servers that no-one takes ownership for will always exist.
Alternatives to keeping the old system alive
We will always get a push back from people who don’t understand an old system, or don’t want to take responsibility for an old system. The alternatives need to be spelled out; the server can be recovered from backups if needed, or the files can be copied on to the new system, or even the data could be extracted to a data warehouse / data lake, for future reporting and analysis.
Old systems pose multiple threats, not just in the overhead of administration and resource use. There are many factors that add to the impact of zombie servers; disk space for the server if on a shared SAN or NAS, plus space used for backups – and the backups of backups, monitoring and log aggregation, operating system patches and anti-virus updates, resource usage and IP address allocation. The threats can include being a bot server or a hacker’s man-in-the-middle attack server, having OS or application vulnerabilities in security or stability, and having un-patched and out of date software (think of Adobe products). Mitigation steps can include; increased scrutiny (checking logs more often), tighter firewall rules (such as only between tiers of a system), and importantly increased process and procedure on how to react if the system is compromised or a threat.
Sometimes, it is just easier to switch it off and wait to see if anyone screams.