Is VAAI beneficial with Virtual Storage Appliance (VSA) based solutions ?

I saw a tweet recently (below) which inspired me to write this post as there is still a clear misunderstanding of the benefits VAAI provides (even with Virtual Storage Appliances).

vaaionvsatweet2

I have removed the identity of the individual who wrote the tweet and the people who retweeted this as the goal of this post is solely to correct what I believe is mis-information.

My interpretation of the tweet was (and remains) if a solution uses a Virtual Storage Appliance (VSA) which resides on the ESXi host then VAAI is not providing any benefits.

My opinion on this topic is:

Compared to a traditional centralised NAS (such as a Netapp or EMC Isilon) providing NFS storage with VAAI-NAS support, a Nutanix or VSA solution has exactly the same benefits from VAAI!

My 1st reply to the tweet was:

vaaionvsatweetconvojoshreply2

The test I was referring to with Netapp OnTap Edge can be found here which was posted in Jan 2013, well prior to my joining Nutanix when I was working for IBM where I had been evangelising VAAI/VCAI based solutions for a long time as VAAI/VCAI provides significant value to VMware customers.

The following shows the persons initial reply to my tweet.

vaaionvsatweetconvo32

I responded with the below mentioning I will do a blog which is what you’re reading now.

I went onto provide some brief replies as shown below.

repliesdetail

The main comments from this persons tweets I would summarize (rightly or wrongly) below:

  • VAAI is designed only to offload functions externally (or off the ESXi host)
  • He/She had not seen any proof of performance advantages from VAAI on VSAs
  • Its broken logic to use VAAI with a VSA

Firstly, I would like comment on VAAI being designed to offload functions externally (or off the ESXi host). I don’t disagree VAAI has some functions designed to offload to the (centralised) array but VAAI also has numerous functions which are designed to bring other efficiencies to a vSphere environment.

An example of a feature designed to offload to a central array is the “XCOPY” primitive.

A simple example of what “XCOPY” or Extended Copy provides is offloading a Storage vMotion on block based storage (i.e.: VMFS over iSCSI,FC,FCoE not NFS) to the array so the ESXi host does not have to process the data movement.

This VAAI primitive would likely be of little benefit in a VSA environment where the storage is presented is block based and Storage DRS for example was used. The data movement would be offloaded from ESXi to the VSA running on ESXi and host would still be burdened with the SvMotion.

However XCOPY is only one of the many primitives of VAAI, and VAAI does alot more than just offload Storage vMotions.

For the purpose of this post, I will be discussing VAAI with Nutanix whos Software defined storage solution runs in a VM on every ESXi host in a Nutanix cluster.
Note: This information is also relevant to other VSAs which support VAAI-NAS.

So what benefit does VAAI provide to Nutanix or a VSA solution running NFS?

Nutanix deploys by default with NFS and supports the VAAI-NAS primitives which are:

  • Full File Clone
  • Fast File Clone
  • Reserve Space
  • Extended Statistics

Note: XCOPY is not supported on NFS, importantly and specifically speaking for Nutanix it is not required as SvMotion will be rarely if ever used with Nutanix solutions.

See my post “Storage DRS and Nutanix – To use, or not to use, that is the question?” for more details on why SvMotion is rarely needed when using Nutanix.

For more details of VAAI primitives, Cormac Hogan (@CormacJHogan) wrote an excellent post which can be found here.

Now here is an example of a significant performance benefits of VAAI with Nutanix.

Lets look at Clone of a VM on a Nutanix platform, the VMs details are below.admin01vm

The VM I have used for this test resides on a datastore called “Management” (as per the above image) which presented via NFS and has VAAI (Hardware Acceleration) enabled as shown below.datastore

Now if I do a simple clone of a VM (as shown below) if the VM is turned on, VAAI-NAS is bypassed as the “Fast File Clone” primitive only works on VMs which are powered off.

clone

So a simple way to test the performance benefits of VAAI on any platform (including Hyper-converged such as Nutanix, a Virtual Storage Appliance (VSA) such as Netapp Ontap Edge or traditional centralised SAN or NAS) is to clone a VM while powered on then shut-down the VM and clone it again.

I performed this test and the first clone with the VM powered on started at 1:17:23 PM and finished at 1:26:12 PM, so a total of 8 mins 49 seconds.

Next I shut down the VM and repeated the clone operation.cloneresults

As we can see in the above screen capture from the 2nd clone started at 1:26:49 PM and finished at 1:26:54 PM, so a total of 5 seconds.

The reason for the huge difference in the speed of the two clones is because VAAI-NAS “Fast File Clone” primitive offloaded the 2nd clone to the Nutanix platform (which runs as a VM on the ESXi host) which has intelligently cloned the VM (using metadata resulting in almost zero data creation) as opposed to 1st clone where VAAI-NAS was not used which resulted in the hypervisor and storage solution having to read 11.18GB of data (being the source VM – Admin01) and write a full copy of the same data resulting in effectively >22GB of data movement in the environment.

Now from a capacity savings perspective, a simple way to demonstrate the capacity savings of VAAI on any platform is to clone a VM multiple times and compare the before and after datastore statistics.

Before I performed this test I captured a baseline of the Management datastore as shown below.

BeforeCloningCapacityVMcount

The above highlighted areas show:

  • Virtual Machines and Templates as 83
  • Capacity 8.49TB
  • Provisioned Space 7.09TB
  • Free Space 7.01TB

I then cloned the Admin01 VM a total of 7 times.clone7vmsrecenttasks

Immediately following the last clone completing I took the below screen shot of the Management datastores statistics.

AfterCloningCapacityVMcount

The above highlighted areas in the updated datastore summary show:

  • Virtual Machines and Templates INCREASED by 7 to 90 (as I cloned 7 VMs)
  • Capacity remained the same at 8.49TB
  • Provisioned Space INCREASED to 7.29TB as we cloned 7 x ~40Gb VMs (Total of ~280GB)
  • Free Space REMAINED THE SAME at 7.01TB due to VAAI-NAS Fast File Clone primitive working with the Nutanix Distributed File System.

So VAAI-NAS allowed a VM of ~11GB of used storage (~40GB provisioned) to be cloned without using any significant additional disk space and the clones were each done in between 5 and 7 seconds each.

So some of the benefits VAAI-NAS provides to Nutanix (which some people would term as a VSA type solution) include:

  • Near instant VM cloning via vSphere Client/s (as shown above)
  • Near instant Horizon View Linked Clone deployments (VCAI) – Similar to example shown.
  • Near instant vCloud Director clones (via FAST Provisioning) - Similar to example shown.
  • Major capacity savings by using Intelligent cloning rather than Full Clones (As shown above)
  • Lower CPU overhead for both ESXi hosts AND Nutanix Controller VM (CVM)
  • Ability to create EagerZeroThick VMDKs on NFS (e.g.: To support Fault Tolerance & clustered workloads such as Oracle RAC)
  • Enhanced ability to get statistics on file sizes , capacity usage etc on NFS

In Summary:

Overall I would say that VMware have developed an excellent API in VAAI and Nutanix along with VSA providers having support for VAAI provides major advantages and value to our joint customers with VMware.

It would be broken logic NOT to leverage the advantages of VAAI regardless of storage type (VSA, Nutanix or traditional centralized SAN/NAS) and for the vast majority of vSphere deployments, any storage solution not supporting (or having issues/bugs with) VAAI will have significant downsides.

I am looking forward to ongoing developments from VMware such as vVols and VASA 2.0 to continue to enhance storage of vSphere solutions in the future.

I hope customers and architects now have the correct information to make the most effective design and purchasing recommendations to meet/exceed customer requirements.

VMworld then and now! (2013 vs 2014)

Last year I did an interview with Eric Sloof @esloof of VMworld TV (below) where we discussed the basics (or the 101) of Nutanix and this was the theme of questions from attendees throughout the Solutions Exchange.

Meet the team behind Nutanix VMworld 2013 – https://www.youtube.com/watch?v=T56KBaB3OUk

Jump forward to this years VMworld (2014) and I was lucky enough to get an opportunity to interview with  VMworld TV again. Eric and I agreed that we didn’t want a simple repeat of last years interview, but talk about more benefits of the platform (or the 201 level).

Nutanix speaks to VMworld TV about their exciting new products – https://t.co/brA15Zgcql

The interesting part of VMworld 2014 and my time on the booth, the theme of questions from attendees was significantly different from last year, and in large part was focused on Business Critical Applications and Server workloads from both prospective and existing customers.

One of my focusses over the last year has been Business Critical Applications and improving the Nutanix platform for these workloads. I am proud to say we (Nutanix) has made significant improvements in this area and we have a strong offering especially with the new NX-8150 platform which my team were responsible for designing.

I am looking forward to interviewing at next years VMworld and covering the Advanced/Expert level topics (301 level) with Eric and the fantastic VMworld TV crew.

Virtualizing Business Critical Applications – The Web-Scale Way!

Since joining Nutanix back in July 2013, I have been working on testing the performance and resiliency of a range of virtual workloads including Business Critical Applications on the Nutanix platform. At the time, Nutanix only offered a single form factor (4 nodes in 2RU) which was not always a perfect fit depending on customer requirements.

Fast forward to August 2014 and now Nutanix has a wide range of node types to meet most workload requirements which can be found here.

The only real gap in the node types was a node which would support applications with large capacity requirements and also have a very large active working set which requires consistent low latency and high performance regardless of tier.

So what do I mean when I say “Active working Set”. I would define this as a data being regularly accessed by the VM/s, for example a file server may have 10TB of data, but users only access 10% on a regular basis. This 10% I would classify as the Active Working Set.

Now back to the topic at hand, The reason I am writing this post is because this has been a project I have been working towards for some time, and I am very excited about this product being released to the market. I have no doubt it will further increase the already fast up take of the Web-scale solutions and provide significant value and opportunities to new and existing customers wanting to simplify their datacenter/s and standardize on Nutanix Web-scale architecture.

Along with many others at Nutanix, we proposed a new node type (being the NX-8150), which has been undergoing thorough testing in my team (Solutions & Performance Engineering) for some time and I am pleased to say is being officially released (very) soon!

nx8050

What is the NX-8150?

A 1 Node per 2RU platform with the following specifications:

* 2 CPU Sockets with two CPU options (E5-2690v2 [20 cores / 3.0 GHz] OR E5-2697v2 [24 cores / 2.7 GHz]
* 4 x Intel 3700 Series SSDs (ranging from 400GB to 1.6TB ea)
* 20 x 1TB SATA HDDs
* Up to 768GB RAM
* Up to 4 x 10GB NICs
* 4 x 1GB NICs
* 1 x IPMI (Out of band Management)

What is the use case for the NX-8150?

Simply put, Applications which have high CPU/RAM requirements with large active working sets and/or the requirement for consistent high performance over a large data set.

Some examples of these applications include:

* Microsoft Exchange including DAG deployments
* Microsoft SQL including Always on Availability Groups
* Oracle including RAC
* SAP
* Microsoft Sharepoint
* Mixed Production Server Workloads with varying Capacity & I/O requirements

The NX-8150 is a great platform for the above workloads as it not only has fast CPUs and up to a massive 768GB of RAM to provide substantial compute resources to VMs, but also up to a massive 6.4TB of RAW SSD capacity for Virtual machines with high IO requirements. For workloads where peak performance is not critical the NX-8150 also provides solid consistent performance across the “Cold Tier” provided by the 20 x 1TB HDDs.

As with all Nutanix nodes, Intelligent Life-cycle management (ILM) maximizes performance by dynamically migrating hot data to SSD and cold data to SATA to provide the best of both worlds being high IOPS and high capacity.

One of the many major advantages of Nutanix Web-Scale architecture is Simplicity and its ability to remove the requirement for application specific silos! Now with the addition of the NX-8150 the vast majority of workloads including Business Critical Applications can be ran successfully on Nutanix, meaning less silos are required, resulting in a simpler, more cost effective, scalable and resilient datacenter solution.

Now with a number of customers already placing advanced orders for NX-8150′s to deploy Business Critical Applications, it wont be long until the now common “Virtual 1st” policies within many organisations turns into a “Nutanix Web-Scale 1st” policy!

Stay tuned for upcoming case studies for NX-8150 based Web-Scale solutions!

My VCAP5-CIA Experience

Yesterday (21st July 2014) I sat and passed the VMware Certified Advanced Professional Cloud Infrastructure Administration (VCAP5-CIA) exam at my local test centre here in Melbourne, Australia.

As with the VCAP-DCA which I did as a prerequisite for VCDX back in 2011, the CIA exam is a live lab exam where VMware get you to demonstrate your hands on expertise with their products.

I find the value of the VCDX, is in part due to the fact it is a requirement to have not only “Design” but hand-on implementation/administration/troubleshooting experience as it is my opinion a person should not be an architect unless that person has the hands on experience and ability to implement and support the solution as designed.

So, enough rambling, what did I think of the VCAP-CIA?

As with all VMware certifications, the exams are generally well written and closely aligned to the blueprints which VMware provide. For VCAP-CIA the blueprint and exam registration can be found here.

The VCAP-CIA was no different, and aligned very well to the blueprint.

The exam is 210 minutes and has 32 questions some of which are simple 1 min tasks where others require a significant amount of work. One secret to all VCAP exams is you are challenged not only by the questions, but by the clock as time is the enemy. This makes time management essential. Do not get caught up of one question, if your unsure, do your best and move on.

Be ware some questions are dependant on successfully completion of earlier questions, but in saying that, a lot of questions are not, so don’t be afraid to skip questions if your struggling as you will still be able to complete many other questions.

The actual live lab in the exam consists of seven ESXi hosts, three vCenter Server virtual machine, four VMware vCloud™ Director (vCD) cells plus additional supporting resources. The lab has a number of pre-configured vApps and virtual machines will also be present for use with certain tasks. It is importaint to understand the lab environment is based on VMware vCloud Suite 5.1 and vCenter Chargeback Manager 2.5, not vCloud 5.5 so ensure you study and prepare using the correct versions of vCloud/vCB!

At this stage some of you may be thinking, I just breached the NDA telling the world about the exam? Well I haven’t and this is the beauty of how VMware does their exam blueprints, the above information is all available in the blueprint so there is not trickery or secrecy to the lab.

As for the questions in the VCAP-CIA, you will not get a brain dump out of me, but what I can tell you is the questions are in most cases very clear and what is asked of the candidates is vastly skills that anyone with any significant vCD experience would be familiar with. For example, the blueprint under Objection 1.2 – Configure vCloud Director for scalability, states under skills and abilities:

 Generate vCloud Director response files
 Add vCloud cells to an existing installation using response files
 Set up vCloud Director transfer storage space
 Configure vCloud Director load balancing

Its safe to say if you know the blueprint properly, you will be able to complete the tasks in the exam, and as a result, get a passing score.

Now the bad news!

Being based in Melbourne, Australia, and the live lab is being accessed by RDP to a location in Seattle, USA. So what does this mean, Latency!

I was only able to complete about 2/3rd’s of the questions in large part due to the delay in the screen refreshing after switching between for example the vCD web interface and production documentation, Putty etc.

On that point, all the PDF and HTML documentation is available in the exam, but I would highly recommend you don’t rely on it, because accessing the doco and searching/scrolling for things is very slow, at least it was for me.

I had numerous occasions where the screen would totally freeze which was a concern, but I soon accepted this was a latency issue, and the lab was fine, and waited out the freezes (which varied from a few seconds to around 20 seconds, which feels like hours when your against the clock!)

I have heard from numerous other VCAP-CIA who sat the exam in the Australia/NZ region that they experienced the same issues, so if you are A/NZ based, or any location a long way from the USA, be prepared for this.

Now being a live lab, the exam is not scored on the spot, and you have to wait for VMware to score the exam and then you will receive an electronic score report via email. The exam receipt says 15 business days, but I was very impressed that less than 24 hours after sitting the exam, I got my score report. Obviously VMware education have done a great job in automating the scoring process, which is a credit to them!

Overall, the experience of the VCAP-CIA was very good, the exam/questions are a solid test of vCloud related skills and experience, so great work VMware Education!

I am very pleased to have completed this exam and all prerequisites for VCDX-Cloud (VCP-Cloud, VCAP-CID and VCAP-CIA) and I will be submitting my application in the near future.

Enterprise Architecture & Avoiding tunnel vision.

Recently I have read a number of articles and had several conversations with architects and engineers across various specialities in the industry and I’m finding there is a growing trend of SMEs (Subject Matter Experts) having tunnel vision when it comes to architecting solutions for their customers.

What I mean by “Tunnel Vision” is that the architect only looks at what is right in front of him/her (e.g.: The current task/project) , and does not consider the implications of how the decisions being made for this task may impact the wider I.T infrastructure and customer from a commercial / operational perspective.

In my previous role I saw this all to often, and it was frustrating to know the solutions being designed and delivered to the customers were in some cases quite well designed when considered in isolation, but when taking into account the “Big Picture” (or what I would describe as the customers overall requirements) the solutions were adding unnecessary complexity, adding risk and increasing costs, when new solutions should be doing the exact opposite.

Lets start with an example;

Customer “ACME” need an enterprise messaging solution and have chosen Microsoft Exchange 2013 and have a requirement that there be no single points of failure in the environment.

Customer engages an Exchange SME who looks at the requirements for Exchange, he then points to a vendor best practice or reference architecture document and says “We’ll deploy Exchange on physical hardware, with JBOD & no shared storage and use Exchange Database Availability Groups for HA.”

The SME then attempts to justify his recommendation with “because its Microsoft’s Best practice” which most people still seem to blindly accept, but this is a story for another post.

In fairness to the SME, in isolation the decision/recommendation meets the customers messaging requirements, so what’s the problem?

If the customers had no existing I.T and the messaging system was going to be the only I.T infrastructure and they had no plans to run any other workloads, I would say the solution proposed could be a excellent solution, but how many customers only run messaging? In my experience, none.

So lets consider the customer has an existing Virtual environment, running Test/Dev, Production and Business Critical applications and adheres to a “Virtual First” policy.

The customer has already invested in virtualization & some form of shared storage (SAN/NAS/Web Scale) and has operational procedures and expertises in supporting and maintaining this environment.

If we were to add a new “silo” of physical servers, there are many disadvantages to the customer including but not limited too;

1. Additional operational documentation for new Physical environment.

2. New Backup & Disaster Recovery strategy / documentation.

3. Additional complexity managing / supporting a new Silo of infrastructure.

4. Reduced flexibility / scalability with physical servers vs virtual machines.

5. Increased downtime and/or impact in the event hardware failures.

6. Increased CAPEX due to having to size for future requirements due to scaling challenges with physical servers.

So what am I getting at?

The cost of deploying the MS Exchange solution on physical hardware could potentially be cheaper (CAPEX) Day 1 than virtualizing the new workload on the existing infrastructure (which likely needs to be scaled e.g.: Disk Shelves / Nodes) BUT would likely result overall higher TCO (Total Cost of Ownership) due to increased complexity & operational costs due to the creation of a new silo of resources.

Both a physical or virtual solution would likely meet/exceed the customers basic requirement to serve MS Exchange, but may have vastly different results in terms of the big picture.

Another example would be a customer has a legacy SAN which needs to be replaced and is causing issues for a large portion of the customers workloads, but the project being proposed is only to address the new Enterprise messaging requirements. In my opinion a good architect should consider the big picture and try to identify where projects can be combined (or a projects scope increased) to ensure a more cost effective yet better overall result for the customer.

If the architect only looked at Exchange and went Physical Servers w/ JBOD, there is zero chance of improvement for the rest of the infrastructure and the physical equipment for Exchange would likely be oversized and underutilized.

It will in many cases be much more economical to combine two or more projects, to enable the purchase of a new technology or infrastructure components and consolidate the workloads onto shared infrastructure rather than building two or more silo’s which add complexity to the environment, and will likely result in underutilized infrastructure and a solution which is inferior to what could have been achieved by combining the projects.

In conclusion, I hope that after reading this article, the next time you or your customers embark on a new project, that you as the Architect, Project Manager, or Engineer consider the big picture and not just the new requirement and ensure your customer/s get the best technical and business outcomes and avoid where possible the use of silos.

Nutanix Tech Notes for VMware vSphere

I thought I would put together a single page which has links to all the current Nutanix Tech Notes relating to VMware vSphere as well as have a bit of a teaser list of upcoming documents.

This will be a living post, and updated regularly as new documents are released.

Tech Notes

1. Nutanix Storage Configuration for VMware vSphere

2. VMware vSphere Networking on Nutanix

3. VMware High Availability Configuration for Nutanix (Coming soon)

4. VMware Distributed Resource Scheduler on Nutanix (Coming soon)

5. VMware Storage configuration on Nutanix (Coming soon)

6. VMware vSphere Cluster design with Nutanix (Coming soon)

7. Optimal Virtual Machine design with Nutanix (Coming soon)

8. Monster VM design with Nutanix (Coming soon)

ESXi Host Isolation Response and custom isolation address configuration.

I was reviewing a vSphere design recently and I came across an interesting design choice which I thought I would share.

The architect selected the isolation response of “Leave Powered On” and disabled  “das.usedefaultisolationaddress”  (which is by default enabled) and configured multiple custom isolation addresses using the “das.isolationadressX” advanced setting.

The architect explained that this was done to minimize the chance of a false positive isolation event. In many environments such as ones using IP storage or where the ESXi Management VMKernel default gateway is not highly available, this can be a very good idea.

In this environment, the storage was provided via FC and the default gateway was highly available.

So was there a benefit in changing the default setting of “das.usedefaultisolationaddress” and configuring custom isolation addresses?

The short answer is No.

This is because the isolation response is configured with “Leave Powered On” so regardless of the host being isolated or not, the Virtual Machines will remain powered on.

So keep it simple, if your isolation response is “Leave Powered On” there is no need to change either of these advanced settings.

The below articles show examples of isolation response and custom isolation addresses configurations for IP Storage, FC storage and Hyper-converged environments.

Related Articles

1. Host Isolation Response for IP Storage
2. Host isolation response for FC based Storage
3. Host Isolation Response for a Nutanix Environment

Example Architectural Decision – Default Virtual Machine Compatibility Configuration

Problem Statement

In a VMware vSphere 5.5 environment, what is the most suitable configuration for Virtual Machine Compatibility setting at the Datacenter and Cluster layers?

Assumptions

1. vSphere Flash Read Cache is not required.
2. VMDKs of greater than 2TB minus 512b are not required.

Motivation

1. Reduce complexity where possible.
2. Maximize supportability.

Architectural Decision

Configure the vSphere Datacenter level “Default VM Compatibility” as “ESXi 5.1 or later” and leave the vSphere Cluster level ”Default VM Compatibility” as “Use datacenter setting and host version” (default).

Justification

1. Avoid limiting management of the environment to the vSphere Web Client.
2. The Default VM Compatibility only needs to be set once at the datacenter layer and then all clusters within the datacenter will inherit the desired setting.
3. Reduce the dependency of the Web Client in the event of a disaster recovery.
4. As vFRC and >2TB VMDKs and vGPU are not required, there is no significant advantage to HW Version 10.
5. Ensuring a standard virtual machine compatibility level is maintained throughout the environment and reducing the chance of mismatched VM version types in the environment.
6. Simplicity.

Implications

1. Virtual Machine Hardware Compatibility automatic update must be DISABLED to prevent the VM hardware being automatically upgraded following a shutdown.
2. vSphere Flash Read Cache (vFRC) cannot be used.
3. VMDKs will remain limited at 2TB minus 512b.

Alternatives

1. Virtual Machine HW Version 10 (vSphere 5.5 onwards).
2. Virtual Machine HW Version 8 (vSphere 5.0 onwards).
3. Virtual Machine HW Version 7 (vSphere 4.1 onwards).
4. Older Virtual machine HW versions.

 

Virtual Machine Swap File Location & Capacity Usage on Nutanix

The Location of the Virtual Machine swap file can be critical when deploying vSphere with traditional centralized storage solutions, or legacy solutions which acknowledge “zeros” or “White-space” as the Virtual Machine swap file can be as large as the VMs configured vRAM where Memory Reservations are not used.

The below shows the default configuration.
VMswapFileLocation

If a VM resides on Tier 1 storage for example, and the VM does not have a memory reservation set (or a reservation of less than 100%), the Swap-file will take up valuable Tier 1 storage capacity.

This can be avoided by specifying a Swap-file datastore however this introduces complexity and in the event the Swap-file datastore is on a low tier of storage, performance in the event of swapping will degrade significantly.

Some platforms recommend having different datastores for VM swap files to minimize the overheads on de duplication or replication for environments using SRM as discussed in Example Architectural Decision – Virtual Machine Swap-file location for SRM Protected VMs.

The Nutanix Distributed File System does not write “White space” to disk, as a result the impact of Virtual Machine swap files is negligible which makes the issue of swap file placement much less of an issue.

The only time when Virtual machine swap files will use storage capacity in the Nutanix Distributed File System is when host memory utilization is >100% and swapping needs to occur.

As such, the default vSphere configuration of “Virtual Machine Directory” is ideal for Nutanix environments and valuable storage capacity is not unnecessarily wasted resulting in increased usable space, reduced complexity by removing the requirement for dedicated swap-file datastores without compromising the benefits of de-duplication and compression.

VCDX Defence Essentials – Part 3 – Preparing for the Troubleshooting Scenario

Following on from Part 1 - Preparing for the Design Defence & Part 2 - Preparing for the Design Scenario, Part 3 covers my tips for the final stage of the VCDX defence, the Troubleshooting Scenario.

After completing the 75min Design defence and the 30min Design Scenario, if your still standing and haven’t retreated at full speed, your final challenge is the 15min Troubleshooting Scenario.

As mentioned in the previous Parts of this series, I am not a official panellist and I do not know how the scoring works. The below is my advice based on conducting mock panels, the success rate of candidates I have conducted mock panels with and my successfully achieving VCDX on the 1st attempt.

If you have read Part 2, then you should notice several similarities in both the common mistakes and tips below.

Common Mistakes

1. Trying to guess the solution to the issue

Taking pot shot guesses at what the problem/s might be does not prove your expertise. If you don’t methodically work through the issue and just keep making guesses, your not doing yourself or the people trying to assess your expertise any good.

2. Not documenting the troubleshooting steps you have completed

Assuming you have not made Mistake #1, and you are methodically working through the troubleshooting scenario, a common mistake I see is a candidate getting confused about what they have or have not investigated.

When candidates repeat the same troubleshooting steps because they have lost track, it does nothing but waste time and does not increase your chance of passing.

15 mins goes by in a flash, you cannot afford to waste time!

3. Going down a rabbit hole

Same as in the design scenario, I have observed many candidates who are clearly very knowledgeable, who have spent the majority of the time troubleshooting one specific area of the environment. eg: Just the vSphere layer

Doing this may demonstrate your expertise in one area really well, but this does not help getting as many potential issues eliminated in the scenario as possible within the time constraint.

4. Being Mute!

Again, same as in the design scenario, I have seen candidates who stand starring at the troubleshooting scenario and the whiteboard for mins at a time.

 

Tips for the Troubleshooting Scenario

1. Do not try to guess the solution to the issue

If you happen to guess the solution (assuming there is one.. hint hint) what expertise have you demonstrated to the panel for them to score you on? The answer is “bugger all” (This is Australian for “none”).

Talk the panel through your troubleshooting methodology, for example, you might choose to go through OSI models layers, or you may choose to start with, Networking, then move onto Storage, then application, then vSphere etc.

The goal of this section of the defence is to demonstrate your troubleshooting skills, so make sure you explain what your trying to eliminate. eg: If a VM has lost connectivity you may ask the panel to perform a vMotion of VM1 from host A to host B. You could explain to the panel that if the ping begins to work following the vMotion, you plan to investigate the networking of Host A. If the ping does not start working, you will continue to investigate for a larger networking issue, such as a VLAN specific problem.

2. Documenting your troubleshooting steps & findings

Ensure you methodically address each of the key areas of a vSphere solution by writing on the whiteboard headings like the following:

a) Storage/SAN/Protocol

b) Networking/Firewall

c) Compute HW

d) Application/Guest OS

e) vSphere

Ensure you eliminate several (i’d suggest >=3) potential issues in each section, so you are covering off the entire environment and record what you have done & the result of the troubleshooting step.

Keep in mind, you only have 15 mins, so 1 item per min is required if you are to cover all areas off thoroughly.

3. Don’t go down a rabbit hole!

Same as in the design scenario, I have observed many candidates who are clearly very knowledgeable, who have spent the majority of the time troubleshooting one specific area of a vSphere environment. eg: Storage

Doing this may demonstrate your expertise in one area really well, but this does not help getting as many potential issues eliminated in the scenario as possible within the time constraint.

Once you have looked into 3 potential issues in storage, move onto Networking, or vSphere etc.

Do not spend more than 60-90 seconds on any one troubleshooting step as this is preventing you demonstrating broad expertise which is the purpose of VCDX.

4. Think out Loud!

Again, same as in the design scenario, I have seen candidates who stand starring at the troubleshooting scenario and the whiteboard totally silent for mins at a time.

Talk the panel through your thought process and expected outcomes for troubleshooting actions.

I cannot give you advise, if I don’t know what your thinking! Same with the panellists, they can’t score you if you don’t verbalize your thought process.

No matter what, keep thinking out loud, if your working through options in your mind, that’s what the panel want’s to hear, so let them hear it!

Summary

I hope the above tips help you prepare for the VCDX design scenario and best of luck with your VCDX journey. For those who are interested, you can read about My VCDX Journey.

If you have any questions on the VCDX process or the advise given in this series please leave your comments and I will compile a list of questions and do a Q&A post.