IT Infrastructure Business Continuity & Disaster Recovery (BC/DR) – Corona Virus edition

Back in 2014, I wrote about Hardware support contracts & why 24×7 4 hour onsite should no longer be required. For those of you who haven’t read the article, I recommend doing so prior to reading this post.

In short, the post talked about the concept of the typical old-school requirement to have expensive 24/7, 2 or 4-hour maintenance contracts and how these become all but redundant when IT solutions are designed with appropriate levels of resiliency and have self-healing capabilities capable of meeting the business continuity requirements.

Some of the key points I made regarding hardware maintenance contracts included:

a) Vendors failing to meet SLA for onsite support.

b) Vendors failing to have the required parts available within the SLA.

c) Replacement HW being refurbished (common practice) and being faulty.

d) The more propitiatory the HW, the more likely replacement parts will not be available in a timely manner.

All of these are applicable to all vendors and can significantly impact the ability to get the IT infrastructure back online or back to a resilient state where subsequent failures may be tolerated without downtime or data loss.

I thought with the current Coronavirus pandemic, it’s important to revisit this topic and see what we can do to improve the resiliency of our critical IT infrastructure and ensure business continuity no matter what the situation.

Let’s start with “Vendors failing to meet SLA for onsite support.”

At the time of writing, companies the world over are asking employees to work from home and operate on skeleton staff. This will no doubt impact vendor abilities to provide their typical levels of support.

Governments are also encouraging social distance – that people isolate themselves and avoid unnecessary travel.

We would be foolish to assume this won’t impact vendor abilities to provide support, especially hardware support.

What about Vendors failing to have the required parts available within the SLA?

Currently I’m seeing significantly reduced flights operating, e.g.: From USA to Europe which will no doubt delay parts shipment to meet the target service level agreements.

Regarding vendors using potentially faulty refurbished (common practice) hardware, this risk in itself isn’t increased, but if this situation occurs, then the delays for shipment of alternative/new parts is likely going to be delayed.

Lastly, infrastructure leveraging propitatory HW makes it more likely that replacement parts will not be available in a timely manner.

What are some of the options Enterprise Architects can offer their customers/employers when it comes to delivering highly resilient infrastructure to meet/exceed business continuity requirements?

Let’s start with the assumption that replacement hardware isn’t available for one week, which is likely much more realistic than same-day replacement for the majority of customers considering the current pandemic.

Business Continuity Requirement #1: Infrastructure must be able to tolerate at least one component failure and have the ability to self heal back to a resilient state where a subsequent failure can be tolerated.

By component failure, I’m talking about things like:

a) HDD/SSDs

b) Physical server/node

c) Networking device such as a switch

d) Storage controller (SAN/NAS controllers, or in the case of HCI, a node)

HDDs/SSDs have been traditionally protected by using RAID and Hot Spares, although this is becoming less common due to RAID’s inherent limitations and high impact of failure.

For physical servers/nodes, products like VMware vSphere, Microsoft Hyper-V and Nutanix AHV all have “High Availability” functions which allow virtual machines to recover onto other physical servers in a cluster in the event of a physical server failure.

For networking, typically leaf/spine topologies provide a sufficient level of protection with a minimum of dual connections to all devices. Depending on the criticality of the environment, quad connections may be considered/required.

Lastly with Storage Controllers, traditional dual controller SAN/NAS have a serious constraint when it comes to resiliency in that they require the HW replacement to restore resiliency. This is one reason why Hyper-CXonverged Infrastructure (a.k.a HCI) has become so popular: Some HCI products have the ability to tolerate multiple storage controller failures and continue to function and self-heal thanks to their distributed/clustered architecture.

So with these things in mind, how do we meet our Business Continuity Requirement?

Disclaimer: I work for Nutanix, a company that provides Hyper-Converged Infrastructure (HCI), so I’ll be using this technology as my example of how resilient infrastructure can be designed. With that said the article and the key points I highlight are conceptual and can be applied to any environment regardless of vendor.

For example, Nutanix uses a Scale Out Shared Nothing Architecture to deliver highly resilient and self healing capabilities. In this example, Nutanix has a small cluster of just 5 nodes. The post shows the environment suffering a physical server failure, and then self healing both the CPU/RAM and Storage layers back to a fully resilient state and then tolerating a further physical server failure.

After the second physical server failure, it’s critical to note the Nutanix environment has self healed back to a fully resilient state and has the ability to tolerate another physical server failure.

In fact the environment has lost 40% of its infrastructure and Nutanix still maintains data integrity & resiliency. If a third physical server failed, the environment would continue to function maintaining data integrity, though it may not be able to tolerate a subsequent disk failure without data becoming unavailable.

So in this simple example of a small 5-node Nutanix environment, up to 60% of the physical servers can be lost and the business would continue to function.

With all these component failures, it’s important to note the Nutanix platform self healing was completed without any human intervention.

For those who want more technical detail, checkout my post which shows Nutanix Node (server) failure rebuild performance.

From a business perspective, a Nutanix environment can be designed so that the infrastructure can self heal from a node failure in minutes, not hours or days. The platform’s ability to self heal in a timely manner is critical to reduce the risk of a subsequent failure causing downtime or data loss.

Key Point: The ability for infrastructure to self heal back to a fully resilient state following one or more failures WITHOUT human intervention or hardware replacement should be a firm requirement for any new or upgraded infrastructure.

So the good news for Nutanix customers is during this pandemic or future events, assuming the infrastructure has been designed to tolerate one or more failures and self heal, the potential (if not likely) delay in hardware replacements is unlikely to impact business continuity.

For those of you who are concerned after reading this that your infrastructure may not provide the business continuity you require, I recommend you get in touch with the vendor/s who supplied the infrastructure and go through and document the failure scenarios and what impact this has on the environment and how the solution is recovered back to a fully resilient state.

Worst case, you’ll identify gaps which will need attention, but think of this as a good thing because this process may identify issues which you can proactively resolve.

Pro Tip: Where possible, choose a standard platform for all workloads.

As discussed in “Thing to consider when choosing infrastructure”, choosing a standard platform to support all workloads can have major advantages such as:

  1. Reduced silos
  2. Increased infrastructure utilisation (due to reduced fragmentation of resources)
  3. Reduced operational risk/complexity (due to fewer components)
  4. Reduced OPEX
  5. Reduced CAPEX

The article summaries by stating:

“if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest you’re doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.”

What are some of the key factors to improve business continuity?

  1. Keep it simple (stupid!) and avoid silos of bespoke infrastructure where possible.
  2. Design BEFORE purchasing hardware.
  3. Document BUSINESS requirements AND technical requirements.
  4. Map the technical solution back to the business requirements i.e.: How does each design decision help achieve the business objective/s.
  5. Document risks and how the solution mitigates & responds to the risks.
  6. Perform operational verification i.e.: Validate the solution works as designed/assumed & perform this testing after initial implementation & maintenance/change windows.

Considerations for CIOs / IT Management:

  1. Cost of performance degradation such as reduced sales transactions/minute and/or employee productivity/moral
  2. Cost of downtime like Total outage of IT systems inc Lost revenue & impact to your brand
  3. Cost of increased resiliency compared to points 1 & 2
    1. I.e.: It’s often much cheaper to implement a more resilient solution than suffer even a single outage annually
  4. How employees can work from home and continue to be productive

Here’s a few tips to ask your architect/s when designing infrastructure:

  1. Document failure scenarios and the impact to the infrastructure.
  2. Document how the environment can be upgraded to provide higher levels of resiliency.
  3. Document the Recovery Time (RTO) and Recovery Point Objectives (RPO) and how the environment meets/exceeds these.
  4. Document under what circumstances the environment may/will NOT meet the desired RPO/RTOs.
  5. Design & Document a “Scalable and repeatable model” which allows the environment to be scaled without major re-design or infrastructure replacement to cater for unforeseen workload (e.g.: Such as a sudden increase in employees working from home).
  6. Avoid creating unnecessary silos of dissimilar infrastructure

Related Articles:

  1. Scale Out Shared Nothing Architecture Resiliency by Nutanix
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.
  3. Nutanix | Scalability, Resiliency & Performance | Index
  4. Nutanix vs VSAN / VxRAIL Comparison Series
  5. How to Architect a VSA , Nutanix or VSAN solution for >=N+1 availability.
  6. Enterprise Architecture and avoiding tunnel vision

Things to consider when choosing infrastructure.

With all the choice in the compute/storage market at the moment, choosing new infrastructure for your next project is not an easy task.

In my experience most customers (and many architects) think about the infrastructure coming up for replacement and look to do a “like for like” replacement with newer/faster technology.

An example of this would be a customer with a FC SAN running Oracle workloads where the customer or architect replaces the end of life Hybrid FC SAN with an All Flash FC SAN and continues running Oracle “as-is”.

Now I’m not saying there is anything wrong with that, however if we consider more than just the one workload, we may be able to achieve our business requirements with a more standardized and cost effective approach than having dedicated infrastructure for specific workloads.

So in this post, I am inviting you to consider the bigger picture.

If we take an example customer has the following workload requirements:

  1. Virtual Desktop (VDI)
  2. Virtualized Business Critical Applications (e.g.: SQL / Exchange)
  3. Long Term Archive (High Capacity, low IOPS)
  4. Business Continuity and Disaster Recovery

It is unlikely any one solution from any vendor is going to be the “best” in all areas as every solution has its pros and cons.

Regarding VDI, I would say most people would agree Hyperconverged Infrastructure (HCI) / Scale out type architectures are strong for VDI, however VDI can be successfully deployed on a traditional SAN/NAS solutions or using non shared local storage in the case of non-persistent desktops.

For vBCA, some people believe physical servers with JBOD storage is best for workloads like Exchange, and Physical + local SSD are best for Databases while many people are realising the benefits of virtualization of vBCA with shared storage such as SAN/NAS or on HCI.

For long term archive, cost per GB is generally one of if not the most critical factor where lots of trays of SATA storage connected to a small dual controller setup may be the most cost effective, whereas an All Flash array would be less likely considered in this use case.

For BC/DR, features such as a Storage Replication Adapter (SRA) for VMware Site Recovery Manager, a stretched cluster capability and some form of snapshot capability and replication would be typical requirements. Some newer technology can do per VM snapshots, whereas older style SAN/NAS technology may be per LUN, so newer technology would have an advantage here, but again, this doesn’t mean one tech should not be considered.

So what product do we choose for each workload type? The best of breed right?

Well, maybe not. Lets have a look at why you might not want to do that.

The below graph shows an example of 3 vendors being compared across the 4 categories I mentioned above being VDI, vBCA, Long Term Archive and BC/DR.

ExmapleGraph

The customer has determined that a score of 3 is required to meet their requirements so a solution failing to achieve a 3 or higher will not be considered (at least for that workload).

As we can see, for VDI Vendor B is the strongest, Vendor A second and Vendor C third, but when we compare BC/DR Vendor C is strongest followed by Vendor A and lastly Vendor B.

We can see for Long Term Archive Vendor A is the strongest with Vendor B and C tied for second place and finally for vBCA Vendor B is the strongest, Vendor A second and Vendor C third.

So if we chose the best vendor for each workload type (or the “Best of breed” solution) we would end up with three different vendors equipment.

  • VDI: Vendor B
  • Long Term Archive: Vendor A
  • BC/DR: Vendor C
  • vBCA: Vendor B

Is this a problem? Not necessarily but I would suggest that there are several things to consider including:

1. Having 3 different platforms to design/install/maintain

This means 3 different sets of requirements, constraints, risks, implications need to be considered.

Some large organisations may not consider this a problem, because they have a team for each area, but isn’t the fact the customer has to have multiple teams to manage infrastructure a problem in itself? Sounds like a significant (and potentially unnecessary) OPEX to me.

2. The best BC/DR solution does not meet the minimum requirements for the vBCA workloads.

In this example, the best BC/DR solution (Vendor C) is also the lowest rated for vBCA. As a result, Vendor C is not suitable for vBCA which means it should not be considered for BC/DR of vBCA. If Vendor C was used for BC/DR of the other workloads, then another product would need to be used for vBCA adding further cost/complexity to the environment.

3. Vendor A is the strongest at Long Term Archive, but has no interoperability with Vendor B and C

Due to the lack of interoperability, while Vendor A has the strongest Archiving solution, it is not suitable for this environment. In this example, the difference between the strongest Long Term Archive solution and the weakest is very small so Vendor B and C also meet the customers requirements.

 4. Multiple Silos of infrastructure may lead to inefficient use.

Just like in the days before Virtualization, we had the bulk of our servers CPU/RAM running at low utilization levels, we had our storage capacity carved up where we had lots of free space in one RAID pack but very little free space in others and we spent lots of time migrating workloads from LUN 1 to LUN 2 to free up capacity for new or existing workloads.

If we have 3 solutions, we may have many TB of available capacity in the VDI environment but be unable to share it with the Long Term Archiving. Or we may have lots of spare compute in VDI and be unable to share it with vBCA.

Now getting back to the graph, the below is the raw data.

rawdata

What we can see is:

  • Vendor B has the highest total (17.1)
  • Vendor A has the second highest total (14.8)
  • Vendor C has the lowest total (12)
  • Vendor C failed to meet the minimum requirements for VDI & vBCA
  • Vendor A and B met the minimum requirements for all areas

Let’s consider the impact of choosing Vendor B for all 4 workload types.

VDI – It was the highest rated, met the minimum requirements for the customer and is best of breed, so in this case Vendor B would be a solid choice.

vBCA – Again Vendor B was the highest rated, met the minimum requirements for the customer and is best of breed, so Vendor B would be a solid choice.

Long Term Archiving: Vendor B was equal last, but importantly met the customer requirements. Vendor A’s solution may have more features and higher performance, but as Vendor B met the requirements, the additional features and/or performance of Vendor A are not required. The difference between Vendor A (Best of Breed) and Vendor B was also minimal (0.5 rating difference) so Vendor B is again a solid choice.

BC/DR: Vendor B was the lowest rated solution for BC/DR, but again focusing on the customers requirements, the solution exceeded the minimum requirement of 3 comfortably with a rating of 4.2. Choosing Vendor B meets the requirements and likely avoids any interoperability and/or support issues, meaning a simpler overall solution.

Let’s think about some of the advantages for a customer choosing a standard platform for all workloads in the event a platform meets all requirements.

1. Lower Risk

Having a standard platform minimizes the chance of interoperability and support issues.

2. Eliminating Silos

As long as you can ensure performance meets requirements for all workloads (which can be difficult on centralized SAN/NAS deployments) then using a standard platform will likely lead to better utilization and higher return on investment (ROI).

3. Reduced complexity / Single Pane of Glass Management

Having one platform means not having to have SMEs in multiple technologies, or in larger organisations multiple SMEs per technology (for redundancy and/or workload) meaning reduced complexity, lower operational costs and possibly centralized management.

4. Lower CAPEX

This will largely depend on the vendor and quantity of infrastructure purchased, however many customers I have worked with have excellent pricing from a vendor as a result of standardizing.

Summary:

I am in no way saying “One size fits all” or that “every problem is a Nail” and recommending you buy a hammer. What I am saying is when considering infrastructure for your environment (or your customers), avoid tunnel vision and consider the other workloads or existing infrastructure in the environment.

In many cases the “Best of Breed” solution is not required and in fact implementing that solution may have significant implications in other areas of the environment.

In other cases, workloads may be so mission critical, that a best of breed solution may be the only way to meet the business requirements, in which case, a using a standard platform that may not meet the requirements would not be advised.

However if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest your doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.

Related Articles:

1. Enterprise Architecture & Avoiding tunnel vision.

2. Peak Performance vs Real World Performance

My VCAP5-CIA Experience

Yesterday (21st July 2014) I sat and passed the VMware Certified Advanced Professional Cloud Infrastructure Administration (VCAP5-CIA) exam at my local test centre here in Melbourne, Australia.

As with the VCAP-DCA which I did as a prerequisite for VCDX back in 2011, the CIA exam is a live lab exam where VMware get you to demonstrate your hands on expertise with their products.

I find the value of the VCDX, is in part due to the fact it is a requirement to have not only “Design” but hand-on implementation/administration/troubleshooting experience as it is my opinion a person should not be an architect unless that person has the hands on experience and ability to implement and support the solution as designed.

So, enough rambling, what did I think of the VCAP-CIA?

As with all VMware certifications, the exams are generally well written and closely aligned to the blueprints which VMware provide. For VCAP-CIA the blueprint and exam registration can be found here.

The VCAP-CIA was no different, and aligned very well to the blueprint.

The exam is 210 minutes and has 32 questions some of which are simple 1 min tasks where others require a significant amount of work. One secret to all VCAP exams is you are challenged not only by the questions, but by the clock as time is the enemy. This makes time management essential. Do not get caught up of one question, if your unsure, do your best and move on.

Be ware some questions are dependant on successfully completion of earlier questions, but in saying that, a lot of questions are not, so don’t be afraid to skip questions if your struggling as you will still be able to complete many other questions.

The actual live lab in the exam consists of seven ESXi hosts, three vCenter Server virtual machine, four VMware vCloud™ Director (vCD) cells plus additional supporting resources. The lab has a number of pre-configured vApps and virtual machines will also be present for use with certain tasks. It is importaint to understand the lab environment is based on VMware vCloud Suite 5.1 and vCenter Chargeback Manager 2.5, not vCloud 5.5 so ensure you study and prepare using the correct versions of vCloud/vCB!

At this stage some of you may be thinking, I just breached the NDA telling the world about the exam? Well I haven’t and this is the beauty of how VMware does their exam blueprints, the above information is all available in the blueprint so there is not trickery or secrecy to the lab.

As for the questions in the VCAP-CIA, you will not get a brain dump out of me, but what I can tell you is the questions are in most cases very clear and what is asked of the candidates is vastly skills that anyone with any significant vCD experience would be familiar with. For example, the blueprint under Objection 1.2 – Configure vCloud Director for scalability, states under skills and abilities:

 Generate vCloud Director response files
 Add vCloud cells to an existing installation using response files
 Set up vCloud Director transfer storage space
 Configure vCloud Director load balancing

Its safe to say if you know the blueprint properly, you will be able to complete the tasks in the exam, and as a result, get a passing score.

Now the bad news!

Being based in Melbourne, Australia, and the live lab is being accessed by RDP to a location in Seattle, USA. So what does this mean, Latency!

I was only able to complete about 2/3rd’s of the questions in large part due to the delay in the screen refreshing after switching between for example the vCD web interface and production documentation, Putty etc.

On that point, all the PDF and HTML documentation is available in the exam, but I would highly recommend you don’t rely on it, because accessing the doco and searching/scrolling for things is very slow, at least it was for me.

I had numerous occasions where the screen would totally freeze which was a concern, but I soon accepted this was a latency issue, and the lab was fine, and waited out the freezes (which varied from a few seconds to around 20 seconds, which feels like hours when your against the clock!)

I have heard from numerous other VCAP-CIA who sat the exam in the Australia/NZ region that they experienced the same issues, so if you are A/NZ based, or any location a long way from the USA, be prepared for this.

Now being a live lab, the exam is not scored on the spot, and you have to wait for VMware to score the exam and then you will receive an electronic score report via email. The exam receipt says 15 business days, but I was very impressed that less than 24 hours after sitting the exam, I got my score report. Obviously VMware education have done a great job in automating the scoring process, which is a credit to them!

Overall, the experience of the VCAP-CIA was very good, the exam/questions are a solid test of vCloud related skills and experience, so great work VMware Education!

I am very pleased to have completed this exam and all prerequisites for VCDX-Cloud (VCP-Cloud, VCAP-CID and VCAP-CIA) and I will be submitting my application in the near future.