IT Infrastructure Business Continuity & Disaster Recovery (BC/DR) – Corona Virus edition

Back in 2014, I wrote about Hardware support contracts & why 24×7 4 hour onsite should no longer be required. For those of you who haven’t read the article, I recommend doing so prior to reading this post.

In short, the post talked about the concept of the typical old-school requirement to have expensive 24/7, 2 or 4-hour maintenance contracts and how these become all but redundant when IT solutions are designed with appropriate levels of resiliency and have self-healing capabilities capable of meeting the business continuity requirements.

Some of the key points I made regarding hardware maintenance contracts included:

a) Vendors failing to meet SLA for onsite support.

b) Vendors failing to have the required parts available within the SLA.

c) Replacement HW being refurbished (common practice) and being faulty.

d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

All of these are applicable to all vendors and can significantly impact the ability to get the IT infrastructure back online or back to a resilient state where subsequent failures may be tolerated without downtime or data loss.

I thought with the current Coronavirus pandemic, it’s important to revisit this topic and see what we can do to improve the resiliency of our critical IT infrastructure and ensure business continuity no matter what the situation.

Let’s start with “Vendors failing to meet SLA for onsite support.”

At the time of writing, companies the world over are asking employees to work from home and operate on skeleton staff. This will no doubt impact vendor abilities to provide their typical levels of support.

Governments are also encouraging social distance – that people isolate themselves and avoid unnecessary travel.

We would be foolish to assume this won’t impact vendor abilities to provide support, especially hardware support.

What about Vendors failing to have the required parts available within the SLA?

Currently I’m seeing significantly reduced flights operating, e.g.: From USA to Europe which will no doubt delay parts shipment to meet the target service level agreements.

Regarding vendors using potentially faulty refurbished (common practice) hardware, this risk in itself isn’t increased, but if this situation occurs, then the delays for shipment of alternative/new parts is likely going to be delayed.

Lastly, infrastructure leveraging propitatory HW makes it more likely that replacement parts will not be available in a timely manner.

What are some of the options Enterprise Architects can offer their customers/employers when it comes to delivering highly resilient infrastructure to meet/exceed business continuity requirements?

Let’s start with the assumption that replacement hardware isn’t available for one week, which is likely much more realistic than same-day replacement for the majority of customers considering the current pandemic.

Business Continuity Requirement #1: Infrastructure must be able to tolerate at least one component failure and have the ability to self heal back to a resilient state where a subsequent failure can be tolerated.

By component failure, I’m talking about things like:

a) HDD/SSDs

b) Physical server/node

c) Networking device such as a switch

d) Storage controller (SAN/NAS controllers, or in the case of HCI, a node)

HDDs/SSDs have been traditionally protected by using RAID and Hot Spares, although this is becoming less common due to RAID’s inherent limitations and high impact of failure.

For physical servers/nodes, products like VMware vSphere, Microsoft Hyper-V and Nutanix AHV all have “High Availability” functions which allow virtual machines to recover onto other physical servers in a cluster in the event of a physical server failure.

For networking, typically leaf/spine topologies provide a sufficient level of protection with a minimum of dual connections to all devices. Depending on the criticality of the environment, quad connections may be considered/required.

Lastly with Storage Controllers, traditional dual controller SAN/NAS have a serious constraint when it comes to resiliency in that they require the HW replacement to restore resiliency. This is one reason why Hyper-CXonverged Infrastructure (a.k.a HCI) has become so popular: Some HCI products have the ability to tolerate multiple storage controller failures and continue to function and self-heal thanks to their distributed/clustered architecture.

So with these things in mind, how do we meet our Business Continuity Requirement?

Disclaimer: I work for Nutanix, a company that provides Hyper-Converged Infrastructure (HCI), so I’ll be using this technology as my example of how resilient infrastructure can be designed. With that said the article and the key points I highlight are conceptual and can be applied to any environment regardless of vendor.

For example, Nutanix uses a Scale Out Shared Nothing Architecture to deliver highly resilient and self healing capabilities. In this example, Nutanix has a small cluster of just 5 nodes. The post shows the environment suffering a physical server failure, and then self healing both the CPU/RAM and Storage layers back to a fully resilient state and then tolerating a further physical server failure.

After the second physical server failure, it’s critical to note the Nutanix environment has self healed back to a fully resilient state and has the ability to tolerate another physical server failure.

In fact the environment has lost 40% of its infrastructure and Nutanix still maintains data integrity & resiliency. If a third physical server failed, the environment would continue to function maintaining data integrity, though it may not be able to tolerate a subsequent disk failure without data becoming unavailable.

So in this simple example of a small 5-node Nutanix environment, up to 60% of the physical servers can be lost and the business would continue to function.

With all these component failures, it’s important to note the Nutanix platform self healing was completed without any human intervention.

For those who want more technical detail, checkout my post which shows Nutanix Node (server) failure rebuild performance.

From a business perspective, a Nutanix environment can be designed so that the infrastructure can self heal from a node failure in minutes, not hours or days. The platform’s ability to self heal in a timely manner is critical to reduce the risk of a subsequent failure causing downtime or data loss.

Key Point: The ability for infrastructure to self heal back to a fully resilient state following one or more failures WITHOUT human intervention or hardware replacement should be a firm requirement for any new or upgraded infrastructure.

So the good news for Nutanix customers is during this pandemic or future events, assuming the infrastructure has been designed to tolerate one or more failures and self heal, the potential (if not likely) delay in hardware replacements is unlikely to impact business continuity.

For those of you who are concerned after reading this that your infrastructure may not provide the business continuity you require, I recommend you get in touch with the vendor/s who supplied the infrastructure and go through and document the failure scenarios and what impact this has on the environment and how the solution is recovered back to a fully resilient state.

Worst case, you’ll identify gaps which will need attention, but think of this as a good thing because this process may identify issues which you can proactively resolve.

Pro Tip: Where possible, choose a standard platform for all workloads.

As discussed in “Thing to consider when choosing infrastructure”, choosing a standard platform to support all workloads can have major advantages such as:

  1. Reduced silos
  2. Increased infrastructure utilisation (due to reduced fragmentation of resources)
  3. Reduced operational risk/complexity (due to fewer components)
  4. Reduced OPEX
  5. Reduced CAPEX

The article summaries by stating:

“if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.”

What are some of the key factors to improve business continuity?

  1. Keep it simple (stupid!) and avoid silos of bespoke infrastructure where possible.
  2. Design BEFORE purchasing hardware.
  3. Document BUSINESS requirements AND technical requirements.
  4. Map the technical solution back to the business requirements i.e.: How does each design decision help achieve the business objective/s.
  5. Document risks and how the solution mitigates & responds to the risks.
  6. Perform operational verification i.e.: Validate the solution works as designed/assumed & perform this testing after initial implementation & maintenance/change windows.

Considerations for CIOs / IT Management:

  1. Cost of performance degradation such as reduced sales transactions/minute and/or employee productivity/moral
  2. Cost of downtime like Total outage of IT systems inc Lost revenue & impact to your brand
  3. Cost of increased resiliency compared to points 1 & 2
    1. I.e.: It’s often much cheaper to implement a more resilient solution than suffer even a single outage annually
  4. How employees can work from home and continue to be productive

Here’s a few tips to ask your architect/s when designing infrastructure:

  1. Document failure scenarios and the impact to the infrastructure.
  2. Document how the environment can be upgraded to provide higher levels of resiliency.
  3. Document the Recovery Time (RTO) and Recovery Point Objectives (RPO) and how the environment meets/exceeds these.
  4. Document under what circumstances the environment may/will NOT meet the desired RPO/RTOs.
  5. Design & Document a “Scalable and repeatable model” which allows the environment to be scaled without major re-design or infrastructure replacement to cater for unforeseen workload (e.g.: Such as a sudden increase in employees working from home).
  6. Avoid creating unnecessary silos of dissimilar infrastructure

Related Articles:

  1. Scale Out Shared Nothing Architecture Resiliency by Nutanix
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.
  3. Nutanix | Scalability, Resiliency & Performance | Index
  4. Nutanix vs VSAN / VxRAIL Comparison Series
  5. How to Architect a VSA , Nutanix or VSAN solution for >=N+1 availability.
  6. Enterprise Architecture and avoiding tunnel vision

The ATO 5-day outage, like most outages was completely avoidable.

A while back I saw news about the Australian Tax Office (ATO) having a major outage of their storage solution and recently an article was posted titled “ATO reveals cause of SAN failure” which briefly discusses a few contributing factors for the five-day outage.

The article from ITnews.com.au quoted ATO commissioner Chris Jordan in saying:

The failure of the 3PAR SAN was the result of a confluence of events: the fibre optic cables feeding the SAN were not optimally fitted, software bugs on the SAN disk drives meant stored data was inaccessible or unreadable, back-to-base HPE monitoring tools weren’t activated, and the SAN configuration was more focused on performance than stability or resilience, Jordan said.

Before we get into breaking down the issues, I want to start by saying while this specific incident was with HPE equipment, this is not isolated to HPE and every vendor has had customers suffer similar issues. The major failing in this case, and in the vast majority of failures (especially extended outages), come back to the enterprise architect/s and operations teams failing to do their job. I’ve seen this time and time again, yet only a very small percentage of so called architects have a methodology and an even smaller percentage follow one in any meaningful way on a day to day basis.

Now back to the article, let’s break this down to a few key points.

1. The fibre optic cables feeding the SAN were not optimally fitted.

While the statement is a bit vague, cabling issues are a common mistake which can and should be easily discovered and resolved prior to going into production. As per Nutanix Platform Expert (NPX) methodology, an “Operational Verification” document should outline the tests required to be performed prior to a system going into production and/or following a change.

An example of a simple test is for a Host (Server) or SAN dual connected to an FC fabric to disconnect one cable and confirm connectivity remains, and then replace the cable and disconnect the other cable and again confirm connectivity,

Another simple test is to remove the power from a FC switch and confirm connectivity via the redundant switch then replace the power and repeat on the other FC switch.

Had an Operational Verification document been created to an NPX standard, and subsequently followed prior to going live and after any changes, this cabling issue would highly likely not have been a contributing factor to the outage.

This is an architectural and operational failure. The reason it’s an operational failure is because no engineer worth having would complete a change without an operational verification document/s to follow to validate a successful implementation/change.

2. Software bugs on the SAN disk drives meant stored data was inaccessible or unreadable.

In my opinion this is where the vendor is likely more at fault than the customer, however customers and their architect/s need to mitigate against these types of risks. Again an Operational Verification document should have tests which confirm functionality (in this case, simple read operations) from the storage, during normal and degraded scenarios such as drive pulls (simulating SSD/HDD failures) and drive shelve loss (i.e.: The loss of a bulk number of drives in a shelf, typically between 12 and 24).

Failure scenarios should be clearly documented and the risk/s, mitigation/s and recovery plan all of which needs to be mapped back to the business requirements, e.g.: Recovery Time Objective (RTO), Recovery Point Objective (RPO).

Again, this is both an architectural and operational failure as the architect should have documented/highlighted the risks as well as mitigation and recovery strategy, while the engineers should never have accepted a solution into BAU (Business as Usual) operations without these documents.

3. “Back-to-base HPE monitoring tools weren’t activated”

There is no excuse for this, and the ATOs architects and to a lesser extent the operational team need take responsibility here. While a vendor should continually be nagging customers to enable these tools, any enterprise architect worth having mandates monitoring tools sufficient to ensure continuous operation of the solution they design. The Operation Verification document would also have steps to test monitoring tools and ensure the alerting and call home functionality is working both before going into production and at scheduled intervals to ensure continued operation.

This is yet another architectural and operational failure.

4. SAN configuration was more focused on performance than stability or resilience.

This not only doesn’t surprise me but highlights a point I have raised for many years being there is a disproportionately high focus on performance, specifically peak performance, compared to data integrity, resiliency and stability.

In 2015 I wrote “Peak Performance vs Real World Performance” after continuously having to have these discussions with customers. The post covers the topic is reasonable depth but some of the key points are:

  1. Peak performance is rarely a significant factor for a storage solution.
  2. Understand and document your storage requirements / constraints before considering products.
  3. Create a viability/success criteria when considering storage which validates the solution meets your requirements within the constraints.

In this case the architect/s who designed the solution had tunnel vision around performance, when the solution likely didn’t need to be configured in such a way to meet the requirements assuming they were well understood and documented/validated.

If the SAN needed to be configured in the way it did to meet the performance requirements, then it was simply the wrong solution because it was not configured to meet the other vastly more important requirements around availability, resiliency and recoverability and the solution was certainly not validated against any meaningful criteria before going into production or many of these issues would not have occurred, or in the unlikely event of multiple concurrent failures, the recoverability requirements were not designed for or understood sufficiently.

This is again an architectural and operational failure.

ATO commissioner Chris Jordan also stated:

While only 12 of 800 disk drives failed, they impacted most ATO systems.

This means the solution was designed/configured with a tolerance for just 1.5% of drives to fail before a catastrophic failure would occur. This in my mind is so far from a minimally viable solution it’s not funny. What’s less funny is that this fact is unlikely to have been understood by the ATO, which means the failure scenarios and associated risks were not documented and mitigated in any meaningful way.

As an example, in even a small four node Nutanix solution with just 24 drives, an entire nodes worth of drives (6) can be lost concurrently (that’s 25%) without data loss or unavailability. In a 5 node Nutanix NX-8150 cluster with RF3, up to 48 drives (of a total 120, which is 40%) can be lost without data loss or unavailability, and the system can even self-heal without hardware replacement to restore resiliency automatically so further failures can be tolerate. This kind of resiliency/recoverability is essential for modern datacenters and something that would have at least mitigated or even avoided this outage altogether.

But this isn’t a product pitch, this is an example of what enterprise architects need to consider when choosing infrastructure for a project, i.e.: What happens if X,Y and/r Z fails and how does the system recover (i.e. Manually, Automatically etc).

Yet another thing which doesn’t surprise me in the fact failure domains do not appear to have been considered as the recovery tools were located on the SAN in which they were required to protect.

Additionally, some of the recovery tools that were required to restore the equipment were located on the SAN that failed.

It is critical to understand failure scenarios!! Wow I am sounding like a broken record but the message is simply not getting through to the majority of architects.

Recovery/management tools are no use to you when they are offline. If they are on the same infrastructure that requires the tools to be online to be able to recover, then your solutions recoverability is at high risk.

Yet another architectural failure followed by an operations team failure for accepting the environment and not highlighting the architecture failures.

In most, if not all enterprise environments, separate management clusters using storage from a separate failure domain is essential. It’s not a “nice to have”, it’s essential. It is very likely the five-day outage would have been reduced, or at least the cause been diagnosed much faster had the ATO had a small, isolated management cluster running the tooling required to diagnose the SAN.

The article concludes with a final quote from ATO commissioner Chris Jordan:

The details are confidential, he said, but the deal recoups key costs incurred by the ATO, and gives the agency new and “higher-grade” equipment to equip it with a “world-class storage network.

I am pleased the vendor (in this case HPE) has taken at least some responsibility and while the details are confidential, from my perspective higher grade equipment and world class storage network mean nothing without an enterprise architect who follows a proven methodology like NPX.

If the architect/s don’t document the requirements, risks, constraints and assumptions and design a solution with supporting documentation which map the solution back to these areas and then document a comprehensive Operational verification procedures for moving into production and for subsequent changes before declaring a change successful, the ATO (and other customers in similar positions) are destined to repeat the same mistakes.

If anyone from the ATO happens to read this, ensure your I.T team have a solid methodology for the new deployment and if they don’t feel free to reach out and I’ll raise my hand to get involved and lead the project to a successful outcome following NPX methodology.

In closing, everyone involved in a project must take responsibility. If the architect screws up, the ops team should call it out, if the ops team call it out and the project manager ignores it, the ops team should escalate. If the escalation doesn’t work, document the issues/risks and continue making your concerns known even after somebody accepts responsibility for the risk. After all, a risk doesn’t magically disappear when a person accepts responsibility, it simply creates a CV generating event for that person when things do go wrong and then the customer is still left up the creek without a paddle.

It’s long overdue so called enterprise architects live up to the standard at which they are (typically) paid. Every major decision by an architect should be documented to a minimum of the standard shown in my Example Architectural Decision section of this blog as well as mapped back to specific customer requirements, risks, constraints and assumptions.

For the ATO and any other customers, I recommend you look for architects with proven track records, portfolios of project documentation which they can share (even if redacted for confidentiality) as well as certifications like NPX and VCDX which require panel style reviews by peers, not multiple choice exams which are all but a waste of paper (e.g.: MCP/VCP/MCSE/CCNA etc). The skills of a VCDX/NPX are transferable to non-VMware/Nutanix environments as it’s the methodology which forms most of the value, the product experience from these certs still has value is also transferable as learning new tech is much easier than finding a great enterprise architect!

And remember, when it comes to choosing an enterprise architect…

cheaper