Eliminating Single Point Of Failure Key To Network Resilience
In this series, Craig Mathias explains why resilience is essential for networks to operate when operational systems are compromised.
Last time, we noted a few examples of natural resilience (by the way, some people use the term “resiliency”, but that word sounds funny to me, anyway), but it’s really artificial resilience that’s of concern in IT domains. And, interestingly, we have a lot of experience in designing and implementing resilient systems across many domains.
The concept of a wireless mesh, for example, began with military (specifically, battlefield) applications, where communications are essential but subject to both radio jamming and “assets” entering and especially “leaving” the “scenario” (the military loves metaphors).
Wireless meshes can re-route traffic on the fly, and of course are often very useful in less hostile settings. In the automotive world, we’ve built tires than can “run flat” after a puncture. We’ve made some progress is designing, but not necessarily implementing, improvements to the nation’s electrical grid to minimize the opportunity for massive blackouts resulting for seemingly minor failures and even cyber attacks.
And the Internet itself is the result of research into building networks that can continue to operate at least partially, even in the event of damage to, or outright failure of, major components (back to the battlefield example above).
Eliminating Single Point of Failure Key To LAN Resilience
The architecture of the Internet has in fact defined most organizational networks in operation today, so we benefit from awareness and thinking that has become cultural in some quarters.
For example, while individual Ethernet switches can fail (although of course they do so only rarely, and usually as the result of an inadvertent disconnection from AC power), most in-building client-device networking today is via Wi-Fi. And a quality enterprise-grade Wi-Fi solution automatically reconfigures in the event of the failure of a given access point (such failure usually being for that same reason – power).
Wi-Fi is thus the key to LAN resilience, along with redundant access points and backhaul. Distinct backup/failover systems are less important, but still valuable in some settings. It’s most important to eliminate any single point of failure to the greatest degree possible (and feasible), given an analysis of the ever-present cost tradeoffs, of course.
Let’s consider a resilience checklist for the organization, as follows:
- Physical integrity – Temporary – or worse – interruptions to physical infrastructure can be a nightmare. Power failures, severe weather, an errant bus careening through the front door – the range of potential disasters is enormous. The traditional technique for dealing with these challenges has been redundant facilities, and these are still advisable in some cases, primarily critical government operations and highly-regulated industries. But, as we’ll explore next time, there are far more convenient and much more cost-effective solutions available today.
- Cybersecurity – As is the case with power failures and weather, issues related to cybersecurity will always be with us. And, what’s worse, the nature of the threats involved is constantly changing and evolving. New releases of critical software (operating systems especially, but any software required for mission-critical services) bring new vulnerabilities. The primary strategy here is to limit software options, use anti-malware and related tools (anti-DoS, anti-virus, etc.), apply appropriate management techniques (such as enterprise mobility management solutions), and to carefully stage upgrades to make sure that any potential new problems are contained.
- External dependencies – As is the case with electrical power, all critical external dependencies must be considered and addressed. These most notably include backhaul and Internet connections, which must be redundant – there is no other option. As for power, UPS solutions cover the range from a single computer through facility-wide, as required. Supply chain interruptions involving the delivery of critical supplies must also be considered.
The resilience challenge: How to remain operational no matter what combination of threats materializes. Next time we’ll look at how to construct a strategy to get this done.
All posts in this series: