Why Resilience Is Key To Safe-Guarding Mission-Critical Networks
In this series, Craig Mathias explains why resilience is essential for networks to operate when operational systems are compromised.
Networks (and the rest of IT infrastructure) are today more mission-critical than ever, and yet also subject to more threats than ever before as well. Most of the time we think of these threats in terms of security, and in this case mostly with respect to unauthorized access to sensitive information, unauthorized access to the network (possibly with compromise to network management), and all manner of malware.
While the landscape of security extends much farther than these essentials alone, security itself has an important counterpart that is often lumped in with the definition of security itself – integrity.
Indeed, integrity is always an element of physical security, which has a goal of ideally preventing physical intrusions or physical damage to critical infrastructure. But integrity has a much broader definition, which is the ability of a solution to fulfill its mission despite partial compromise to its core implementation.
So it’s not just a question of security alone, but rather of the solution continuing to operate under a defined set of normal – and abnormal – operating conditions. This characteristic is also often called reliability.
Abnormal, though, sometimes seems to be the norm. For example, a data center might be specified to operate under severe adverse weather conditions, but what about the storm of the century? Ask the many IT managers in New York City whose operations were devastated by Superstorm Sandy in 2012.
Sometimes conditions simply overwhelm a solution, causing damage that is simply unrecoverable. And the range of these possible conditions is enormous – floods, tornadoes, earthquakes, fires, and many more natural and sometimes human-caused (both unintentional, and, sadly, sometimes, intentional) disasters.
And the list goes on – the Challenger and Columbia space-shuttle disasters, bridges collapsing in otherwise seemingly-harmless windstorms, ignition keys in cars causing collisions, and many more. Engineers can design solutions that work, but always with fundamental limitations imposed by operating conditions. No solution, therefore, is perfect.
But it is nonetheless more than desirable to design solutions that are fault-tolerant, resistant to failure, and capable of continuing at least partial operations under conditions that might otherwise impair functionality.
What does “resilience” mean?
This is a property of systems known as resilience, a topic that should be of interest to IT managers in every organization everywhere. With IT operations now critical to the success of any organization (as in, zero downtime), the ability to operate when the functionality of operational systems is partially degraded or compromised is essential.
Of course, the degree of damage permissible is a key design parameter and one that affects solution architecture, configuration, and especially cost.
Other terms associated with resilience include survivability, redundancy, continuity, and disaster recovery, with the last one here of particular interest. Many resilience solutions to date have been designed around the concept of manually cutting over to a backup solution should a primary capability fail.
These standby capabilities can be hot (ready to go at a moment’s notice), warm (available with some degree of latency), or cold (typically via spares, and thus requiring not just manual intervention but also perhaps a good deal of time).
And none of these strategies is ideal – resilience today demands the ability to operate continuously given a degree of damage or other interruptions to the normal operations of mission-critical solutions, and without manual intervention. In other words, you can hit resilient solutions and they bounce right back with no special effort on the part of operations teams.
Resilience is in fact a vital element in a broad array of both natural and artificial systems. In the natural world, we talk about resilience in ecological systems, psychology, and biology. Imagine if a cut finger were universally fatal – there’s no way humans as a whole could survive for very long, as the population would likely dwindle to an unsustainable level.
Consider also bacteria that become resistant to antibiotics – their genes actually evolve new resilience mechanisms in response to a threat.
There’s a lesson in there for managers of critical IT resources (and what resource isn’t critical these days?) – think about resilience, incorporate it into strategies and plans, and monitor for opportunities to enhance resilience either manually or, as we’ll discuss later in this series, automatically.
All posts in this series: