The value of the hyperscaler + hypervisor model

Public cloud offerings for “hyperscalers” such as AWS EC2, Microsoft Azure & Google GCP provide a lot of value when it comes to be able to stand up and run virtual workloads in a timely manner and provide various capabilities to create globally resilient solutions.

All of these offerings also boast a varying/wide range of native services which can compliment or replace services running in traditional virtual machines.

As I’ve previously stated in a post from August 2022, Direct to Cloud Value – Part 1, the hyperscalers have two major advantages customers can benefit from:

  1. A Well understood architecture
  2. Global availability

Designing, deploying and maintaining “on-premises” infrastructure on the other hand is often far less attractive from a time to value perspective and requires significant design efforts by highly qualified, experienced (and paid) individuals in order to get anywhere close to the scalability, reliability and functionality of the hyperscalers.

On-premises infrastructure may not be cost effective for smaller customers/environments who don’t have the quantity of workloads/data to make it cost effective, so “native” public cloud solutions at a high level are often a great choice for customers.

The problem for many customers is they’re established businesses with a wide range of applications from numerous vendors, many of which are not easy to simply migrate to a public cloud provider.

Workload refactoring is often a time consuming and complex task which is not always able to be achieved in a timely manner, and in many cases not at all.

Customers also rarely have the luxury of starting from and/or just building a greenfield environment due to the overall cost and/or the requirement to get a return on investment (ROI) from existing infrastructure.

Customers often have the requirement to burst during peak periods which isn’t something easily achievable on-premises. Customers often need to significantly oversize their on-premises infrastructure just to be able to support end of month, quarter or peak periods such as “Black Friday” for retailers.

This oversizing does help mitigate risks and deliver business outcomes, but it comes at a high cost (CAPEX).

Enter the “Hyperscaler + Hypervisor” model.

The hyperscaler + hypervisor model is where the hyperscaler (AWS/Azure/Goolgle) provides bare metal servers (a.k.a instances) where a hypervisor (in the above example, VMware ESXi) is running along with Virtual SAN (a.k.a “vSAN”) to provide the entire VMware technology stack to run Virtual Machines (VMs).

Nutanix has a similar offering called “Nutanix Cloud Clusters” or “NC2” using their own hypervisor “AHV”.

Both the VMware & Nutanix offerings gives the same look/feel to their customers as they have today on-premises.

The advantages of the hyperscaler + hypervisor model are enormous from both a business and technical perspective, the following are just a few examples.

  • Ease of Migration

A migration of VMware based workloads from an existing on-premises environment can be achieved using a variety of methods including VMware native tools such HCX as well as third party tools from backup vendors such as Commvault without having to refactor workloads.

This is achieved without the cost/complexity and delay of refactoring workloads.

  • Consistent look and feel

The Hyperscaler + hypervisor options provide customers access to the same management tools they’re used to on-premises meaning there is minimal adjustment required for I.T teams.

  • Built-in Cloud exit strategy / No Cloud Vendor “Lock in”

The hypervisor layer allows customers to quickly move from one hyperscaler to another again without refactoring, giving customers real bargaining power when it comes to negotiating commercial arrangements.

It also enables a move off public cloud back to on-premises.

  • Faster Time to value

The ability to stand up net new environments typically within a few hours gives customers the ability to respond to unexpected situations as well as new projects without the time/complexity of procurement and designing/implementing new environments from the ground up.

One very important value here is the ability to respond to critical situations such as ransomware by standing up an entirely isolated net new infrastructure to restore known good data. This is virtually impossible to do on-premises.

  • Lower Risk

In the event of a significant commercial/security/technical issue, a hyperscaler + hypervisor environment can be scaled up, migrated to a new environment/provider or isolated.

This model also mitigates against the delays caused by under-sizing or failure scenarios where new hardware needs to be added as this can occur typically within an hour or so as opposed to days/weeks/months.

As in the next example, workloads can simply be “lifted and shifted” minimising the number of changes/risks involved with a public cloud migration.

In the event of hardware failures, new hardware can be added back to the environment/s straight away without waiting for replacement hardware to be shipped/arrive and be installed. This greatly minimises the chance of double/subsequent failures causing an impact to the environment.

In the case of a disaster such a region failure, a new region can be scaled up to restore production whereas standing/scaling up a new on-prem environment is unlikely to occur in a timely manner.

  • Avoiding the need to “re-factor” workloads

Simply lifting and shifting workloads “as-is” on the same underlying hypervisor ensures the migration can occur with as few dependancies (and risks) as possible.

  • Provides excellent performance

The hardware provided by these offerings varies but often are all NVMe storage with latest or close to latest generation CPU/Memory, ensuring customers are not stuck with older generation hardware.

Having all workloads share a pool of NVMe storage also avoids the issue where some instances (VMs) are assigned to a lower tier of storage due to commercial cost constraints which can have significant downstream effects on other workloads/applications.

The all NVMe option in hyperscalers + hypervisor solutions becomes cost effective due to the economies of scale and elimination of “Cloud waste” which I will discuss next.

In many cases customers will be moving from a multiple year old hardware & storage solutions, simply having an all NVMe storage layer can reduce latency and subsequently make more efficient use of CPU/Memory often resulting in significant performance improvements let alone newer generation CPUs.

  • Economies of scale

In many cases, purchasing on a per instance (VM) basis may be attractive in the beginning, but when you reach a certain level of workloads, it makes more sense to buy in bulk (i.e.: A bare metal instance) and run the workloads on top of a hypervisor.

This gives the customer the benefit of the hypervisors ability to efficiently and effectively oversubscribe CPU and with a hyper-converged (HCI) storage layer (Virtual SAN a.k.a vSAN or Nutanix AOS) customers benefit from the native data reduction capabilities such as Compression, Deduplication and Erasure Coding.

  • Avoids native cloud instance constraints a.k.a “Cloud waste”

Virtual Machine “right-sizing” is to this day one of the most under-rated tasks but this can provide not only lower cost, but significant performance improvements for VMs. Cloud Waste occurs when workloads are forced into pre-defined instance sizes where small amounts of resources such as vCPUs or vRAM are assigned to the VM, but not required/use.

When we have the hypervisor layer, instance sizes can be customised to the exact requirements and eliminate cloud waste which I’ve personally observed in many customer environments to be in the range of 20-40%.

Credit: Steve Kaplan for coining the term “Cloud Waste”.

  • Increased Business Continuity / Disaster Recovery options

The cost/complexity involved with building business continuity and disaster recovery (BC/DR) solutions often lead to customers having to accept and try to mitigate significant risks to their businesses.

The hyperscaler + hypervisor model provides a number of options to have very cost effective BC/DR solutions including across multiple providers to mitigate against large global provider outages.

  • An OPEX commercial model

The ability to commit to a monthly minimum spend to get the most attractive rates while having the flexibility to burst when required (albeit at a less attractive price) means customers don’t have to try and fund large CAPEX projects and have the ability to scale in a “just in time” fashion.

Cost

This sounds to good to be true, what about cost?

On face value, these offerings can appear expensive compared to on-premises equivalents, but from the numerous assessments I’ve conducted I am confident the true cost is closer to or even cheaper than on-premises especially when a proper Total Cost of Ownership (TCO) is performed.

Compared with “native cloud” i.e.: Running workloads without the hypervisor layer, the hyperscaler + hypervisor solution will typically save customers 20-40% while providing equal or better performance and resiliency.

One other area which can make costs higher than necessary is a lack of optimisation with the workloads. I highly recommend for both on-premises and hyperscaler models that customers engage an experienced architect to review their environment thoroughly.

The performance benefits of a right sizing exercise are typically be significant AND it saves valuable IT resources (CPU/RAM). It also means less hardware is required to achieve the same or even a better outcome and therefore lowering costs.

Summary

The hyperscaler + hypervisor model has many advantages both commercially and technically and with the ease of setup, migration to and scaling in public cloud, I expect this model to become extremely popular.

I would strongly recommend anyone looking at replacing their on premises infrastructure in the near future do a thorough assessment of these offerings against their business goals.

End-2-End Enterprise Architecture (@E2EEA) has multiple highly experienced and certified staff at the highest level with both VMware (VCDX) and Nutanix (NPX) technologies and can provide expert level services to help you assess the hyperscaler + hypervisor options as well as design and deliver the solution.

E2EEA can be reached at sales@e2eea.com

My VCAP5-CIA Experience

Yesterday (21st July 2014) I sat and passed the VMware Certified Advanced Professional Cloud Infrastructure Administration (VCAP5-CIA) exam at my local test centre here in Melbourne, Australia.

As with the VCAP-DCA which I did as a prerequisite for VCDX back in 2011, the CIA exam is a live lab exam where VMware get you to demonstrate your hands on expertise with their products.

I find the value of the VCDX, is in part due to the fact it is a requirement to have not only “Design” but hand-on implementation/administration/troubleshooting experience as it is my opinion a person should not be an architect unless that person has the hands on experience and ability to implement and support the solution as designed.

So, enough rambling, what did I think of the VCAP-CIA?

As with all VMware certifications, the exams are generally well written and closely aligned to the blueprints which VMware provide. For VCAP-CIA the blueprint and exam registration can be found here.

The VCAP-CIA was no different, and aligned very well to the blueprint.

The exam is 210 minutes and has 32 questions some of which are simple 1 min tasks where others require a significant amount of work. One secret to all VCAP exams is you are challenged not only by the questions, but by the clock as time is the enemy. This makes time management essential. Do not get caught up of one question, if your unsure, do your best and move on.

Be ware some questions are dependant on successfully completion of earlier questions, but in saying that, a lot of questions are not, so don’t be afraid to skip questions if your struggling as you will still be able to complete many other questions.

The actual live lab in the exam consists of seven ESXi hosts, three vCenter Server virtual machine, four VMware vCloud™ Director (vCD) cells plus additional supporting resources. The lab has a number of pre-configured vApps and virtual machines will also be present for use with certain tasks. It is importaint to understand the lab environment is based on VMware vCloud Suite 5.1 and vCenter Chargeback Manager 2.5, not vCloud 5.5 so ensure you study and prepare using the correct versions of vCloud/vCB!

At this stage some of you may be thinking, I just breached the NDA telling the world about the exam? Well I haven’t and this is the beauty of how VMware does their exam blueprints, the above information is all available in the blueprint so there is not trickery or secrecy to the lab.

As for the questions in the VCAP-CIA, you will not get a brain dump out of me, but what I can tell you is the questions are in most cases very clear and what is asked of the candidates is vastly skills that anyone with any significant vCD experience would be familiar with. For example, the blueprint under Objection 1.2 – Configure vCloud Director for scalability, states under skills and abilities:

 Generate vCloud Director response files
 Add vCloud cells to an existing installation using response files
 Set up vCloud Director transfer storage space
 Configure vCloud Director load balancing

Its safe to say if you know the blueprint properly, you will be able to complete the tasks in the exam, and as a result, get a passing score.

Now the bad news!

Being based in Melbourne, Australia, and the live lab is being accessed by RDP to a location in Seattle, USA. So what does this mean, Latency!

I was only able to complete about 2/3rd’s of the questions in large part due to the delay in the screen refreshing after switching between for example the vCD web interface and production documentation, Putty etc.

On that point, all the PDF and HTML documentation is available in the exam, but I would highly recommend you don’t rely on it, because accessing the doco and searching/scrolling for things is very slow, at least it was for me.

I had numerous occasions where the screen would totally freeze which was a concern, but I soon accepted this was a latency issue, and the lab was fine, and waited out the freezes (which varied from a few seconds to around 20 seconds, which feels like hours when your against the clock!)

I have heard from numerous other VCAP-CIA who sat the exam in the Australia/NZ region that they experienced the same issues, so if you are A/NZ based, or any location a long way from the USA, be prepared for this.

Now being a live lab, the exam is not scored on the spot, and you have to wait for VMware to score the exam and then you will receive an electronic score report via email. The exam receipt says 15 business days, but I was very impressed that less than 24 hours after sitting the exam, I got my score report. Obviously VMware education have done a great job in automating the scoring process, which is a credit to them!

Overall, the experience of the VCAP-CIA was very good, the exam/questions are a solid test of vCloud related skills and experience, so great work VMware Education!

I am very pleased to have completed this exam and all prerequisites for VCDX-Cloud (VCP-Cloud, VCAP-CID and VCAP-CIA) and I will be submitting my application in the near future.

VCDX Defence Essentials – Part 3 – Preparing for the Troubleshooting Scenario

Following on from Part 1 – Preparing for the Design Defence & Part 2 – Preparing for the Design Scenario, Part 3 covers my tips for the final stage of the VCDX defence, the Troubleshooting Scenario.

After completing the 75min Design defence and the 30min Design Scenario, if your still standing and haven’t retreated at full speed, your final challenge is the 15min Troubleshooting Scenario.

As mentioned in the previous Parts of this series, I am not a official panellist and I do not know how the scoring works. The below is my advice based on conducting mock panels, the success rate of candidates I have conducted mock panels with and my successfully achieving VCDX on the 1st attempt.

If you have read Part 2, then you should notice several similarities in both the common mistakes and tips below.

Common Mistakes

1. Trying to guess the solution to the issue

Taking pot shot guesses at what the problem/s might be does not prove your expertise. If you don’t methodically work through the issue and just keep making guesses, your not doing yourself or the people trying to assess your expertise any good.

2. Not documenting the troubleshooting steps you have completed

Assuming you have not made Mistake #1, and you are methodically working through the troubleshooting scenario, a common mistake I see is a candidate getting confused about what they have or have not investigated.

When candidates repeat the same troubleshooting steps because they have lost track, it does nothing but waste time and does not increase your chance of passing.

15 mins goes by in a flash, you cannot afford to waste time!

3. Going down a rabbit hole

Same as in the design scenario, I have observed many candidates who are clearly very knowledgeable, who have spent the majority of the time troubleshooting one specific area of the environment. eg: Just the vSphere layer

Doing this may demonstrate your expertise in one area really well, but this does not help getting as many potential issues eliminated in the scenario as possible within the time constraint.

4. Being Mute!

Again, same as in the design scenario, I have seen candidates who stand starring at the troubleshooting scenario and the whiteboard for mins at a time.

 

Tips for the Troubleshooting Scenario

1. Do not try to guess the solution to the issue

If you happen to guess the solution (assuming there is one.. hint hint) what expertise have you demonstrated to the panel for them to score you on? The answer is “bugger all” (This is Australian for “none”).

Talk the panel through your troubleshooting methodology, for example, you might choose to go through OSI models layers, or you may choose to start with, Networking, then move onto Storage, then application, then vSphere etc.

The goal of this section of the defence is to demonstrate your troubleshooting skills, so make sure you explain what your trying to eliminate. eg: If a VM has lost connectivity you may ask the panel to perform a vMotion of VM1 from host A to host B. You could explain to the panel that if the ping begins to work following the vMotion, you plan to investigate the networking of Host A. If the ping does not start working, you will continue to investigate for a larger networking issue, such as a VLAN specific problem.

2. Documenting your troubleshooting steps & findings

Ensure you methodically address each of the key areas of a vSphere solution by writing on the whiteboard headings like the following:

a) Storage/SAN/Protocol

b) Networking/Firewall

c) Compute HW

d) Application/Guest OS

e) vSphere

Ensure you eliminate several (i’d suggest >=3) potential issues in each section, so you are covering off the entire environment and record what you have done & the result of the troubleshooting step.

Keep in mind, you only have 15 mins, so 1 item per min is required if you are to cover all areas off thoroughly.

3. Don’t go down a rabbit hole!

Same as in the design scenario, I have observed many candidates who are clearly very knowledgeable, who have spent the majority of the time troubleshooting one specific area of a vSphere environment. eg: Storage

Doing this may demonstrate your expertise in one area really well, but this does not help getting as many potential issues eliminated in the scenario as possible within the time constraint.

Once you have looked into 3 potential issues in storage, move onto Networking, or vSphere etc.

Do not spend more than 60-90 seconds on any one troubleshooting step as this is preventing you demonstrating broad expertise which is the purpose of VCDX.

4. Think out Loud!

Again, same as in the design scenario, I have seen candidates who stand starring at the troubleshooting scenario and the whiteboard totally silent for mins at a time.

Talk the panel through your thought process and expected outcomes for troubleshooting actions.

I cannot give you advise, if I don’t know what your thinking! Same with the panellists, they can’t score you if you don’t verbalize your thought process.

No matter what, keep thinking out loud, if your working through options in your mind, that’s what the panel want’s to hear, so let them hear it!

Summary

I hope the above tips help you prepare for the VCDX design scenario and best of luck with your VCDX journey. For those who are interested, you can read about My VCDX Journey.

If you have any questions on the VCDX process or the advise given in this series please leave your comments and I will compile a list of questions and do a Q&A post.