IT Infrastructure Business Continuity & Disaster Recovery (BC/DR) – Corona Virus edition

Back in 2014, I wrote about Hardware support contracts & why 24×7 4 hour onsite should no longer be required. For those of you who haven’t read the article, I recommend doing so prior to reading this post.

In short, the post talked about the concept of the typical old-school requirement to have expensive 24/7, 2 or 4-hour maintenance contracts and how these become all but redundant when IT solutions are designed with appropriate levels of resiliency and have self-healing capabilities capable of meeting the business continuity requirements.

Some of the key points I made regarding hardware maintenance contracts included:

a) Vendors failing to meet SLA for onsite support.

b) Vendors failing to have the required parts available within the SLA.

c) Replacement HW being refurbished (common practice) and being faulty.

d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

All of these are applicable to all vendors and can significantly impact the ability to get the IT infrastructure back online or back to a resilient state where subsequent failures may be tolerated without downtime or data loss.

I thought with the current Coronavirus pandemic, it’s important to revisit this topic and see what we can do to improve the resiliency of our critical IT infrastructure and ensure business continuity no matter what the situation.

Let’s start with “Vendors failing to meet SLA for onsite support.”

At the time of writing, companies the world over are asking employees to work from home and operate on skeleton staff. This will no doubt impact vendor abilities to provide their typical levels of support.

Governments are also encouraging social distance – that people isolate themselves and avoid unnecessary travel.

We would be foolish to assume this won’t impact vendor abilities to provide support, especially hardware support.

What about Vendors failing to have the required parts available within the SLA?

Currently I’m seeing significantly reduced flights operating, e.g.: From USA to Europe which will no doubt delay parts shipment to meet the target service level agreements.

Regarding vendors using potentially faulty refurbished (common practice) hardware, this risk in itself isn’t increased, but if this situation occurs, then the delays for shipment of alternative/new parts is likely going to be delayed.

Lastly, infrastructure leveraging propitatory HW makes it more likely that replacement parts will not be available in a timely manner.

What are some of the options Enterprise Architects can offer their customers/employers when it comes to delivering highly resilient infrastructure to meet/exceed business continuity requirements?

Let’s start with the assumption that replacement hardware isn’t available for one week, which is likely much more realistic than same-day replacement for the majority of customers considering the current pandemic.

Business Continuity Requirement #1: Infrastructure must be able to tolerate at least one component failure and have the ability to self heal back to a resilient state where a subsequent failure can be tolerated.

By component failure, I’m talking about things like:

a) HDD/SSDs

b) Physical server/node

c) Networking device such as a switch

d) Storage controller (SAN/NAS controllers, or in the case of HCI, a node)

HDDs/SSDs have been traditionally protected by using RAID and Hot Spares, although this is becoming less common due to RAID’s inherent limitations and high impact of failure.

For physical servers/nodes, products like VMware vSphere, Microsoft Hyper-V and Nutanix AHV all have “High Availability” functions which allow virtual machines to recover onto other physical servers in a cluster in the event of a physical server failure.

For networking, typically leaf/spine topologies provide a sufficient level of protection with a minimum of dual connections to all devices. Depending on the criticality of the environment, quad connections may be considered/required.

Lastly with Storage Controllers, traditional dual controller SAN/NAS have a serious constraint when it comes to resiliency in that they require the HW replacement to restore resiliency. This is one reason why Hyper-CXonverged Infrastructure (a.k.a HCI) has become so popular: Some HCI products have the ability to tolerate multiple storage controller failures and continue to function and self-heal thanks to their distributed/clustered architecture.

So with these things in mind, how do we meet our Business Continuity Requirement?

Disclaimer: I work for Nutanix, a company that provides Hyper-Converged Infrastructure (HCI), so I’ll be using this technology as my example of how resilient infrastructure can be designed. With that said the article and the key points I highlight are conceptual and can be applied to any environment regardless of vendor.

For example, Nutanix uses a Scale Out Shared Nothing Architecture to deliver highly resilient and self healing capabilities. In this example, Nutanix has a small cluster of just 5 nodes. The post shows the environment suffering a physical server failure, and then self healing both the CPU/RAM and Storage layers back to a fully resilient state and then tolerating a further physical server failure.

After the second physical server failure, it’s critical to note the Nutanix environment has self healed back to a fully resilient state and has the ability to tolerate another physical server failure.

In fact the environment has lost 40% of its infrastructure and Nutanix still maintains data integrity & resiliency. If a third physical server failed, the environment would continue to function maintaining data integrity, though it may not be able to tolerate a subsequent disk failure without data becoming unavailable.

So in this simple example of a small 5-node Nutanix environment, up to 60% of the physical servers can be lost and the business would continue to function.

With all these component failures, it’s important to note the Nutanix platform self healing was completed without any human intervention.

For those who want more technical detail, checkout my post which shows Nutanix Node (server) failure rebuild performance.

From a business perspective, a Nutanix environment can be designed so that the infrastructure can self heal from a node failure in minutes, not hours or days. The platform’s ability to self heal in a timely manner is critical to reduce the risk of a subsequent failure causing downtime or data loss.

Key Point: The ability for infrastructure to self heal back to a fully resilient state following one or more failures WITHOUT human intervention or hardware replacement should be a firm requirement for any new or upgraded infrastructure.

So the good news for Nutanix customers is during this pandemic or future events, assuming the infrastructure has been designed to tolerate one or more failures and self heal, the potential (if not likely) delay in hardware replacements is unlikely to impact business continuity.

For those of you who are concerned after reading this that your infrastructure may not provide the business continuity you require, I recommend you get in touch with the vendor/s who supplied the infrastructure and go through and document the failure scenarios and what impact this has on the environment and how the solution is recovered back to a fully resilient state.

Worst case, you’ll identify gaps which will need attention, but think of this as a good thing because this process may identify issues which you can proactively resolve.

Pro Tip: Where possible, choose a standard platform for all workloads.

As discussed in “Thing to consider when choosing infrastructure”, choosing a standard platform to support all workloads can have major advantages such as:

  1. Reduced silos
  2. Increased infrastructure utilisation (due to reduced fragmentation of resources)
  3. Reduced operational risk/complexity (due to fewer components)
  4. Reduced OPEX
  5. Reduced CAPEX

The article summaries by stating:

“if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.”

What are some of the key factors to improve business continuity?

  1. Keep it simple (stupid!) and avoid silos of bespoke infrastructure where possible.
  2. Design BEFORE purchasing hardware.
  3. Document BUSINESS requirements AND technical requirements.
  4. Map the technical solution back to the business requirements i.e.: How does each design decision help achieve the business objective/s.
  5. Document risks and how the solution mitigates & responds to the risks.
  6. Perform operational verification i.e.: Validate the solution works as designed/assumed & perform this testing after initial implementation & maintenance/change windows.

Considerations for CIOs / IT Management:

  1. Cost of performance degradation such as reduced sales transactions/minute and/or employee productivity/moral
  2. Cost of downtime like Total outage of IT systems inc Lost revenue & impact to your brand
  3. Cost of increased resiliency compared to points 1 & 2
    1. I.e.: It’s often much cheaper to implement a more resilient solution than suffer even a single outage annually
  4. How employees can work from home and continue to be productive

Here’s a few tips to ask your architect/s when designing infrastructure:

  1. Document failure scenarios and the impact to the infrastructure.
  2. Document how the environment can be upgraded to provide higher levels of resiliency.
  3. Document the Recovery Time (RTO) and Recovery Point Objectives (RPO) and how the environment meets/exceeds these.
  4. Document under what circumstances the environment may/will NOT meet the desired RPO/RTOs.
  5. Design & Document a “Scalable and repeatable model” which allows the environment to be scaled without major re-design or infrastructure replacement to cater for unforeseen workload (e.g.: Such as a sudden increase in employees working from home).
  6. Avoid creating unnecessary silos of dissimilar infrastructure

Related Articles:

  1. Scale Out Shared Nothing Architecture Resiliency by Nutanix
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.
  3. Nutanix | Scalability, Resiliency & Performance | Index
  4. Nutanix vs VSAN / VxRAIL Comparison Series
  5. How to Architect a VSA , Nutanix or VSAN solution for >=N+1 availability.
  6. Enterprise Architecture and avoiding tunnel vision

A TCO Analysis of Pure FlashStack & Nutanix Enterprise Cloud

In helping to prepare this TCO with Steve Kaplan here at Nutanix, I’ll be honest and say I was a little surprised at the results.

The Nutanix Enterprise Cloud platform is the leading solution in the HCI space and it while it is aimed to deliver great business outcomes and minimise CAPEX,OPEX and TCO, the platform is not designed to be “cheap”.

Nutanix is more like the top of the range model from a car manufacturer with different customer requirements. Nutanix has options ranging from high end business critical application deployments to lower end products for ROBO, such as Nutanix Xpress model.

Steve and I agreed that our TCO report needed to give the benefit of the doubt to Pure Storage as we do not claim to be experts in their specific storage technology. We also decided that as experts in Nutanix Enterprise Cloud platform and employees of Nutanix, that we should minimize the potential for our biases towards Nutanix to come into play.

The way we tried to achieve the most unbiased view possible is to give no benefit of the doubt to the Nutanix Enterprise Cloud solution. While we both know the value that many of the Nutanix capabilities have (such as data reduction), we excluded these benefits and used configurations which could be argued at excessive/unnecessary such as vSphere or RF3 for data protection:

  1. No data reduction is assumed (Compression or Deduplication)
  2. No advantage for data locality in terms of reduced networking requirements or increased performance
  3. Only 20K IOPS @ 32K IO Size per All Flash Node
  4. Resiliency Factor 3 (RF3) for dual parity data protection which is the least capacity efficient configuration and therefore more hardware requirements.
  5. No Erasure Coding (EC-X) meaning higher overheads for data protection.
  6. The CVM is measured as an overhead with no performance advantage assumed (e.g.: Lower latency, Higher CPU efficiency from low latency, Data Locality etc)
  7. Using vSphere which means Nutanix cannot take advantage of AHV Turbo Mode for higher performance & lower overheads

On the other hand, the benefit of the doubt has been given to Pure Storage at every opportunity in this comparison including the following:

  1. 4:1 data reduction efficiency as claimed
  2. Only 2 x 10GB NICs required for VM and Storage traffic
  3. No dedicated FC switches or cables (same as Nutanix)
  4. 100% of claimed performance (IOPS capability) for M20,M50 and M70 models
  5. Zero cost for the project/change control/hands on work to swap Controllers as the solution scales
  6. IOPS based on the Pure Storage claimed average I/O size of 32K for all IO calculations

We invited DeepStorage and Vaughn Stewart of Pure Storage to discuss the TCO and help validate our assumptions, pricing, sizing and other details. Both parties declined.

Feedback/corrections regarding the Pure Storage sponsored Technical Report by DeepStorage was sent via Email, DeepStorage declined to discuss the issues and the report remains online with many factual errors and an array (pun intended) of misleading statements which I covered in detail in my Response to: DeepStorage.net Exploring the true cost of Converged vs Hyperconverged Infrastructure

It’s important to note that the Nutanix TCO report is based on the node configuration chosen by DeepStorage with only one difference: Nutanix sized for the same usable capacity, but went with an All Flash solution because comparing hybrid and all flash is apples and oranges and a pointless comparison.

With that said, the configuration DeepStorage chose does not reflect an optimally designed Nutanix solution. An optimally designed solution would likely result in fewer nodes by using 14c or 18c processors to match the high RAM configuration (512GB) and different (lower) capacity SSDs (such as 1.2TB or 1.6TB) which would deliver the same performance and still meet the capacity requirements which would result in a further advantage in both CAPEX, OPEX and TCO (Total Cost of Ownership).

The TCO shows that the CAPEX is typically in the favour of the Nutanix all flash solution. We have chosen to show the costs at different stages in scaling from 4 to 32 nodes – the same as the DeepStorage report. The FlashStack product had slightly lower CAPEX on a few occasions which is not surprising and also not something we tried to hide to make Nutanix always look cheaper.

One thing which was somewhat surprising is that even with the top of the range Pure M70 controllers and a relatively low IO per VM assumption of 250, above 24 nodes the Pure system could not support the required IOPS and an additional M20 needed to be added to the solution. What was not surprising is in the event an additional pair of controllers and SSD is added to the FlashStack solution, that the Nutanix solution had vastly lower CAPEX/OPEX and of course TCO. However, I wanted to show what the figures looked like if we assume IOPS was not a constraint for Pure FlashStack as could be the case in some customer environments as customer requirements vary.

PureVNutanixComparisonWithLowerIOPS

What we see above is the difference in CAPEX is still just 14.0863% at 28 nodes and 13.1272% difference at 32 nodes in favor of Pure FlashStack.

The TCO, however, is still in favor of Nutanix at 28 nodes by 8.88229% and 9.70447% difference at 32 nodes.

If we talk about the system performance capabilities, the Nutanix platform is never constrained by IOPS due to the scale out architecture.

Based on Pure Storage advertised performance and a conservative 20K IOPS (@ 32K) per Nutanix node, we see (below) that Nutanix IO capability is always ahead of Pure FlashStack, with the exception of a 4 node solution based on our conservative IO assumptions. In the real world, even if Nutanix was only capable of 20K IOPS per node, the platform vastly exceeds the requirements in this example (and in my experience, in real world solutions) even at 4 node scale.

PurevsNTNXIOPS

I’ve learned a lot, as well as re-validated some things I’ve previously discovered, from the exercise of contributing to this Total Cost of Ownership (TCO) analysis.

Some of the key conclusions are:

  1. In many real world scenarios, data reduction is not required to achieve a lower TCO than a competing product which leverages data reduction.
  2. Even the latest/greatest dual controller SANs still suffer the same problems of legacy storage when it comes to scaling to support capacity/IO requirements
  3. The ability to scale without rip/replace storage controllers greatly simplifies customers sizing
  4. Nutanix has a strong advantage in Power, Cooling, Rack Space and therefore helps avoid additional datacenter related costs.
  5. Even the top of the range All Flash array from arguably the top vendor in the market (Pure Storage) cannot match the performance (IOPS or throughput) of Nutanix.

The final point I would like to make is the biggest factor which dictates the cost of any platform, be it the CAPEX, OPEX or TCO is the requirements, constraints, risks and assumptions. Without these, and a detailed TCO any discussion of cost has no basis and should be disregarded.

In our TCO, we have detailed the requirements, which are in line with the DeepStorage report but go further to make a solution have context. The Nutanix TCO report covers the high level requirements and assumptions in the Use Case Descriptions.

Without further ado, here is the link to the Total Cost of Ownership comparison between Pure FlashStack and Nutanix Enterprise Cloud platform along with the analysis by Steve Kaplan.

Problem: ROBO/Dark Site Management, Solution: XCP + AHV

Problem:

Remote office / Branch Office commonly referred to as “ROBO” and dark sites (i.e.: offices without local support staff and/or network connectivity to a central datacenter) are notoriously difficult to design, deploy and manage.

Why have infrastructure at ROBO?

The reason customers have infrastructure at ROBO and/or Dark Sites is because these sites require services which cannot be provided centrally due to any number of constraints such as WAN bandwidth/latency/availability or, more frequently, security constraints.

Challenges:

Infrastructure at ROBO and/or dark sites need to be functional, highly available and performant without complexity. The problem is as the functional requirements of the ROBO/dark Sites are typically not dissimilar to the infrastructure in the datacenter/s, the complexity of these sites can be equal to the primary datacenter if not greater due to the reduced budgets for ROBOs.

This means in many cases the same management stack needs to be designed on a smaller scale, deployed and somehow managed at these remote/secure sites with minimal to no I.T presence onsite.

Alternatively, Management may be ran centrally but this can have its own challenges especially when WAN links are high latency/low bandwidth or unreliable/offline.

Typical ROBO deployment requirements.

Typical requirements are in many cases not dis-similar to those of the SMB or enterprise and include things like High Availability (HA) for VMs, so a minimum of 2 nodes and some shared storage. Customers also want to ensure ROBO sites can be centrally managed without deployment of complex tooling at each site.

ROBO and Dark Sites are also typically deployed because in the event of WAN connectivity loss, it is critical for the site to continue to function. As a result, it is also critical for the infrastructure to gracefully handle failures.

So let’s summarise typical ROBO requirements:

  • VM High Availability
  • Shared Storage
  • Be fully functional when WAN/MAN is down
  • Low/no touch from I.T
  • Backup/Recovery
  • Disaster Recovery

Solution:

Nutanix Xtreme Computing Platform (XCP) including PRISM and Acropolis Hypervisor (AHV).

Now let’s dive into with XCP + PRISM + AHV is a great solution for ROBO.

A) Native Cross Hypervisor & Cloud Backup/Recovery & DR

Backup/Recovery and DR are not easy things to achieve or manage for ROBO deployments. Luckily these capabilities are built-in to Nutanix XCP. This includes the ability to take point in time application consistent snapshots and replicate those to local/remote XCP clusters & Cloud Providers (AWS/Azure). These snapshots can be considered backups once replicated to a 2nd location (ideally offsite) as well as be kept locally on primary storage for fast recovery.

ROBO VMs replicated to remote/central XCP deployments can be restored onto either ESXi or Hyper-V via the App Mobility Fabric (AMF) so running AHV at the ROBO has no impact on the ability to recover centrally if required.

This is just another way Nutanix is ensuring customer choice and proves the hypervisor is well and truely a commodity.

In addition XCP supports integration with the market leader in data protection, Commvault.

B) Built in Highly Available, Distributed Management and Monitoring

When running AHV, all XCP, PRISM and AHV management, monitoring and even the HTML 5 GUI are built in. The management stack requires no design, sizing, installation , scaling or 3rd party backend database products such as SQL/Oracle.

For those of you familiar with the VMware stack, XCP + AHV provides capabilities provided by vCenter, vCenter Heartbeat, vRealize Operations Manager, Web Client, vSphere Data Protection, vSphere Replication. And it does this in a highly available and distributed manner.

This means, in the event of a node failure, the management layer does not go down. If the Acropolis Master node goes down, the Master roles are simply taken over by an Acropolis Slave within the cluster.

As a result, the ROBO deployment management layer is self healing which dramatically reduces the complexity and and all but removes the requirement for onsite attendance by I.T.

C) Scalability and Flexibility

XCP with AHV ensures than even when ROBO deployments need to scale to meet compute or storage requirements, the platform does not need to be re-architected, engineered or optimised.

Adding a node is as simple as plugging it in, turning it on and the cluster can be expanded not disruptively via PRISM (locally or remotely) in just a few clicks.

When the equipment becomes end of life, XCP also allows nodes to be non-disruptively removed from clusters and new nodes added, which means after the initial deployment, ongoing hardware replacements can be done without major redesign/reconfiguration of the environment.

In fact, deployment of new nodes can be done by people onsite with minimal I.T knowledge and experience.

D) Built-in One Click Maintenance, Upgrades for the entire stack.

XCP supports one-click, non-disruptive upgrades of:

  • Acropolis Base Software (NDSF layer),
  • Hypervisor (agnostic)
  • Firmware
  • BIOS

This means there is no need for onsite I.T staff to perform these upgrades and XCP eliminates potential human error by fully automating the process. All upgrades are performed one node at a time and only started if the cluster is in a resilient state to ensure maximum uptime. Once one node is upgraded, it is validated as being successful (Similar to a Power on self test or POST) before the next node proceeds. In the event an upgrade fails, the cluster will remain online as I have described in this post.

These upgrades can also be done centrally via PRISM Central.

E) Full Self Healing Capabilities

As I have already touched on, XCP + AHV is a fully self healing platform. From the Storage (NDSF) layer to the virtualization layer (AHV) through to management (PRISM) the platform can fully self heal without any intevenston from I.T admins.

With Nutanix XCP you do not need expensive hardware support contracts or to worry about potential subsequent failures, because the system self heals and does not depend on hardware replacement as I have described in hardware support contracts & why 24×7 4 hour onsite should no longer be required.

Anyone who has ever managed a multi-site environment knows how much effort hardware replacement is, as well as the fact that replacements must be done in a timely manner which can delay other critical work. This is why Nutanix XCP is designed to be distributed and self healing as we want to reduce the workload for sysadmins.

F) Ease of Deployment

All of the above features and functionality can be quickly/easily deployed from out of the box to fully operational ready to run VMs in just minutes.

The Management/Monitoring solutions do not require detailed design (sizing/configuration) as they are all built in and they scale as nodes are added.

G) Reduced Total Cost of Ownership (TCO)

When it comes down to it, ROBO deployments can be critical to the success of a company and trying to do things “cheaper” rarely ends up actually being cheaper. Nutanix XCP may not be the cheapest (CAPEX) but we will be the lowest TCO which is after all what matters.

If you’re a sysadmin and you don’t think you can get any more efficient after reading the above than what you’re doing today, its because you already run XCP + AHV 🙂

In all seriousness, sysadmin’s should be innovating and providing value back to the business. If they are instead spending any significant time “keeping the lights on” for ROBO deployments then their valuable time is not being well utilised.

Summary:

Nutanix XCP + AHV provides all the capabilities required for typical ROBO deployments while reducing the initial implementation and ongoing operational cost/complexity.

With Acropolis Operating System 4.6 and the cross hypervisor backup/recovery/DR capabilities thanks to the App Mobility Fabric (AMF), there is no need to be concerned about the underlying hypervisor as it has become a commodity.

AHV performance and availability is on par if not better than other hypervisors on the market as is clear from several points we have discussed.

Related Articles:

  1. Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.