How to design for failures

Posted on: 6 August 2022 8 September 2022
Categories: Culture, Disaster Recovery, HowTo, Opinion, Strategy, Tips & Tricks
Tags: Backup, Change, Disaster Recovery, Leadership, projects

Often misunderstood, but the concept of “design for failure” is now common in the lexicon of system design and business operations. When you design for failure, it is not because you want to fail – instead it is with the understanding that failures can and do happen, but you want to be able to identify the failure and rapidly pivot – or to change direction – to recover from the failure and get back on track. When you design to fail, you are ensuring that you are not putting your trust and hopes in a single direction, expecting that it will work 100%, instead you are pushing the boundaries of known experience and knowledge to try new things that may have never been done before. This article will tell you a little bit more on how to design for failures.

Design to fail

In traditional methods used for implementations, effort is put into upfront planning and ensuring that there is a clear and defined way forward. The problem is that things happen, not everything goes to plan, and failures occur. A failure can be partial, of a component or step, or of the entire project or activity – and in traditional approaches these failures can be catastrophic because no-one has planned for the failure of the clearly planned success of the activity.

By designing with an understanding that there is a chance of failure, and having a step-back or fail-back or fail-forward plan, you can ensure that the inevitable problems will have a plan for when they eventuate. More than just mitigation techniques, designing for failure is a way to ensure that failure is considered a part of the project, not a blocker. Furthermore, with effective design for failure and planning techniques, you can limit the spread of failure through partitioning, governance, standby measures and degrade options.

READ ARTICLE: The changing face of the IT department

Types of designs for failure

There are ways that failure can be considered within the design focus of how to design for failures, and these can include approaches such as;

Step-back – something has gone wrong, so go back to a previous step and either retry or pivot. This is also known as Re-try (not retry). When failure happens, you learn from that and either modify the attempt or go in a new direction – not just repeating the same action.
Roll-back – when something has happened, un-do the work that has been done, and return to an earlier state. This is also known as Undo. In a failure situation this is un-doing the activity that lead to the failure so that you can go back to a known-good state.
Fail-back – drop the work that has caused the failure, and return to a previous known-good state. This can be referred to as a Refuge plan. Designing for failure means that you can do this quickly, and with less pain or approval than if you had expected that every action you take was going to be a 100% success.
Fail-forward – an approach where a failure is taken as a learning point, and you carry on and ensure you do not continue on making the same mistake. Also known as “well THAT didn’t work”. In your designing for failure, this approach allows and even expects failure, and does not let it hamper your progress.

Limit the spread of failure

There can also be considerations to ensure that failure does not spread;

Partitioning – also known as decoupling, ensuring that a failure in one area does not affect another. Prepare your partitions to limit the “blast zone” of a failure so that a problem in one area does not affect another. An example can be in software development to use micro-services that will not affect the whole system if the vendor of one service has a failure or issue.
Governance – the bureaucracy that we all hate, but can put in controls and gates that can understand that failures can and will happen, and with an attitude that they are part of normal life and that the focus should be on learning, returning to a normal state (or progressing with the project), and not on blame and victimisation.
Standby – when there is a component or process that may fail, it normally will – so prepare a standby ready for the failure to switch over to an utilise. A rapid switch-over may take pre-planning so that the transition can be well understood. An example can be in project planning where a delay of a delivery by a supplier means that resources can be re-allocated to another process.
Degrade / Derate – This approach takes the understanding that when there is a failure in one part, that the rest of the whole can continue, but at a level of operation that is recognised to be lower. An example can be where a strategy was to move in to a larger office, but there is no power on one floor – planning to continue to put people in to other floors at a higher density or to work from home.

READ ARTICLE: Digital Transformation is about process and people, not technology