VCDX Defence Essentials – Part 1 – Preparing for the Design Defence

Even before I achieved my VCDX in May 2012, I had been helping VCDX candidates by doing design reviews and more importantly conducting mock panels.

So over the last couple of years I would estimate I would have been involved with at least 15 candidates, which range from a mock panel and advice over a WebEx, to mentoring candidates through their entire journey.

I would estimate I have conducted easily 30+ mock panels, from which I have decided to put together the most common mistakes candidates make, along with my tips for the defence.

Note: I am not a official panellist and I do not know how the scoring works. The below is my advice based on conducting mock panels, the success rate of candidates I have conducted mock panels with and my successfully achieving VCDX on the 1st attempt.

Common Mistakes

1. Using a Fictitious Design

In all cases where I have done a mock panel for a candidate using a Fictitious design, even in cases where I did not know it was fictitious, it becomes very obvious very quickly.

The reason it is obvious for a mock panellist is due to the lack of depth the candidate can go into about the solution, for example, requirements.

In my experience, candidates using fictitious designs generally take multiple attempts before successfully defending.

This may sound harsh, but if you need to use a fictitious design, you probably don’t have enough architecture experience for VCDX, otherwise you would be able to choose from numerous designs to submit, rather than creating a fictitious design.

Some candidates use fictitious designs for privacy or NDA reasons, in this case, I would strongly recommend you should be able to remove customer specific details and defend a real design.

Note: For those VCDXs who have passed with a fictitious design, I am not in any way taking away from there achievement, if anything, they had more of a challenge that people like me who used a real design.

2. Giving an answer of “I was not responsible for that portion of the design”.

If you give this answer, you are demonstrating that you do not have an expert level understanding of the solution as a whole, which translates to Risk/s as the part of the design you were responsible for may not be compatible with the component/s you were not involved in.

A VCDX candidate may not always be the lead architect on a project, but a VCDX level candidate will always ensure he/she has a thorough understanding of the total solution and will ask the right questions of other architects involved with the project to gain at least a solid level of understanding of all parts of the solution.

With this understanding of other components of the total solution, a candidate should be able to discuss in detail how each component influenced other areas of the design, and what impact (positive or negative) this had on the solution.

3. Not knowing your design! 

This is the one which surprises me the most, if your considering submitting for VCDX, or already have submitted and been accepted, you should already know your design back to front, including the areas which you may not have been responsible for.

You should not be dependant on the power point presentation used in your defence as this is really for the benefit of the panellists, not for you to read word for word.

Think about the VCDX panel this way, You (should) know more about your design that the panel does for the simple reason, its your design.

So you have an advantage over the panellists – ensure you maximize this advantage by knowing your design back to front.

If you cannot comfortably talk about your design for 75mins without referring to reference material, you probably should review your design until you can.

4. Not having clear and concise answers of varying depths for the panellists questions

I hear you all saying, how the hell do I know what the panel will ask me? As a VCDX candidate preparing to defend, your basically saying, I am an Expert in virtualization and I want to come and have my expertise validated.

As an Expert (not a Professional or Specialist, but an EXPERT) you should be able to go through your design, and with your reviewers hat on, write down literally 50+ questions that you would ask if you were reviewing this document for somebody else, or indeed, acting as a real or mock panellist.

In my experience, I predicted approx 80% of the questions the panel ended up asking me, which made my defence a much less stressful experience than it may have been otherwise.

Once you have written down these questions (seriously, 50+), you should ask yourself those questions and ensure you have answers to them. The answers you should have, or prepare should be

a) 1st Level – 30 seconds or less which cover the key points at a high level

b) 2nd Level – A further 30 – 60 seconds which expands on (and does not repeat) the 1st level statements

c) 3rd Level – A further 30-60 seconds which is very detailed and shows your deep understanding of the topic.

I would suggest if you don’t have solid 1st and 2nd Level answers to the questions, your probably not ready for VCDX. The 3rd level questions, in some areas you should be strong and be able to go to this depth, in other areas, you may not, but you should prepare regardless and focus on your weak areas.

5. Giving BS answers

I can’t put this any nicer, if you think you can get away with giving a BS answer to the VCDX panel, or even a good mock panellist, your sorely mistaken.

It never ceases to amaze me, people seem to refuse to admit when they don’t know something – nobody knows everything, don’t be afraid to say, I Don’t know.

Don’t waste the precious time that you have to demonstrate your expertise by giving BS answers that will do nothing to help your chances of passing.

In a mock panel situation, take note of any questions your asked, which you dont know, or dont have strong answers too, and review your design, do some research and ensure you understand the topic in detail and can speak about it.

This may result in you finding a weakness in your design which even if your design has been accepted already, you have the chance to highlight these weaknesses in your defence and discuss the implications and what you would/could to differently – this is a great way to demonstrate expertise.

6. Not knowing about Alternatives to your design

If you work for Vendor X, and Vendor X has a pre packaged converged solution, with a cookie cutter reference architecture which you customize for each client, you could have been successfully deploying solutions for years and be an expert in that solution, but this alone doesn’t make you a VCDX, in fact it could mean quite the opposite.

If your solution is a vBlock, with Cisco UCS, EMC storage (FC/FCoE) and vSphere, how would your solution be impacted if the customer at the last min said, we want to use Netapp storage and NFS or what about if the customer dropped EMC/Netapp and went for Nutanix. How would the solution change, what are the pros and cons and how would this impact your vSphere design choices?

If you can’t talk to this, in detail, for example the Pros and Cons of for example

a) Block vs File based storage

b) Blade vs Rack mount

c) Enterprise Plus verses Standard Edition for your environment

d) Isolation response for Block verses IP storage

Then your not a Design Expert, your at best a Vendor X solution specialist.

If you do use a FlexPod, vBlock or Nutanix type solution with a reference architecture (RA) or best practices, you should know the reasons behind every decision in the reference architecture as if you were the person who wrote the document, not just customized it.

Tips for the defence

1. Answer questions before they are asked

Expanding on Common Mistake #4, as previously mentioned, you should be able to work out the vast majority of questions the panel will ask you, by reviewing your design and having others also review your work.

With this information in mind, as you present your architecture, use statements such as

“I used the following configuration for reasons X,Y,Z as doing so mitigated risks 1,2,3 and met the requirements R01,R02”.

This technique allows you to demonstrate expertise by showing you understand why you made a decision, the risks you mitigated and the requirements you met, without being asked a single question.

So what are the advantages of doing this?

a) You demonstrate expertise while saving time therefore maximizing your chance of a passing mark

b) You can prepare these statements, and potential avoid being interrupted and loosing your train of thought.

2. When asked a question, don’t be too long winded.

As mentioned in Common Mistake #4, preparing short concise answers is critical. Don’t give a 5 min answer to a question as this is likely to be wasting time. Give your level 1 answer which should cover the key concepts and decision points in around 30 seconds, and if the panel drills down further, give your Level 2 answer which expands on the Level 1 answer, and so on.

This means you can maximize your time to maximize your score in other areas. If the panel is not satisfied with your answer, they will ask it again, time permitting.

Which leads us nicely onto the next item:

3. If a question is asked twice, go straight to Level 2 answer

If the panel has asked you a question, and you gave the Level 1 answer, and later on your asked the same question again, its possible you gave an unclear or incorrect answer the first time, so now is your chance to correct a mistake or improve your score.

Think about then answer you gave previously, if you made a mistake for any reason, call it out by saying something like, “Earlier I mistakenly said X, however the fact/s are….”

This will show the panel you know you made a mistake, and you do in fact know the correct answer or the topic in question.

4. Near the end of your Defence give more detailed answers

In the first half of the 75mins, giving your Level 1 and maybe Level 2 answers allows you to save time and maximize your score across all areas.

As you get pass the half way mark and nearing the 20 min remaining mark, at this stage, you should have gone over most areas of your design and now is the chance to maximize your score.

When asked questions at this stage, I would suggest the Level 2/3 answers are what you should be giving. Where you may have given a good Level 1 answer, now is the chance to move from a good answer, to a great answer and maximize your score.

Summary

I hope the above tips help you prepare for the VCDX defence and best of luck with your VCDX journey. For those who are interested, you can read about My VCDX Journey.

In Part 2, I will go through Preparing for the Design Scenario, and how to maximize your 30 mins.

 

Scale Out Shared Nothing Architecture Resiliency by Nutanix

At VMware vForum Sydney this week I presented “Taking vSphere to the next level with converged infrastructure”.

Firstly, I wanted to thank everyone who attended the session, it was a great turnout and during the Q&A there were a ton of great questions.

I got a lot of feedback at the session and when meeting people at vForum about how the Nutanix scale out shared nothing architecture tolerates failures.

I thought I would summarize this capability as I believe its quite impressive and should put everyone’s mind at ease when moving to this kind of architecture.

So lets take a look at a 5 node Nutanix cluster, and for this example, we have one running VM. The VM has all its data locally, represented by the “A” , “B” and “C” and this data is also distributed across the Nutanix cluster to provide data protection / resiliency etc.

Nutanix5NodeCluster

So, what happens when an ESXi host failure, which results in the Nutanix Controller VM (CVM) going offline and the storage which is locally connected to the Nutanix CVM being unavailable?

Firstly, VMware HA restarts the VM onto another ESXi host in the vSphere Cluster and it runs as normal, accessing data both locally where it is available (in this case, the “A” data is local) and remotely (if required) to get data “B” and “C”.

Nutanix5nodecluster1failed

Secondly, when data which is not local (in this example “B” and “C”) is accessed via other Nutanix CVMs in the cluster, it will be “localized” onto the host where the VM resides for faster future access.

It is importaint to note, if data which is not local is not accessed by the VM, it will remain remote, as there is no benefit in relocating it and this reduces the workload on the network and cluster.

The end result is the VM restarts the same as it would using traditional storage, then the Nutanix cluster “curator” detects if any data only has one copy, and replicates the required data throughout the cluster to ensure full resiliency.

The cluster will then look like a fully functioning 4 node cluster as show below.

5NodeCluster1FailedRebuild

The process of repairing the cluster from a failure is commonly incorrectly compared to a RAID pack rebuild. With a raid rebuild, a small number of disks, say 8, are under heavy load re striping data across a hot spare or a replacement drive. During this time the performance of everything on the RAID pack is significantly impacted.

With Nutanix, the data is distributed across the entire cluster, which even with a 5 node cluster will be at least 20 SATA drives, but with all data being written to SSD then sequentially offloaded to SATA.

The impact of this process is much less than a RAID rebuild as all Nutanix controllers in the cluster participate and take a portion of the workload as a result the impact per disk, per controller ,per node and importantly for production VMs running in the cluster, is greatly reduced.

Essentially, the larger the cluster, the faster the cluster can repair itself, and the lower the impact on production workloads.

Now lets talk about a subsequent ESXi host failure, now we have two failed nodes, and three surviving nodes, and only one copy of data “A” , “B” and “C” as shown below.

Nutanix5NodeCluster2failures1copydata

Now the Nutanix “Curator” detects only one copy of data “A”, “B” and “C” exists and starts to replicate copies of “A”, “B” and “C” across the cluster. This results in the below which is a fully functional and redundant cluster, capable of surviving yet another failure as shown below.

Nutanix5NodeCluster2Failures

Even in this scenario, where two ESXi hosts are lost, the environment still has 60% of its storage controllers (and performance), as compared to a typical traditional storage product where the loss of just two (2) controllers can have your environment completely offline, and even if you only lost a single controller, you would only have 50% of the storage controllers (and performance) available.

I think this really highlights what VMware and players like Google, Facebook & Twitter have been saying for a long time, scaling out not up, and shared nothing architecture is the way of the future. The only question is who will be dominant in bringing this technology to the mass market, and I think you know who I have my money on.

Scaling problems with traditional shared storage

At VMware vForum Sydney this week I presented “Taking vSphere to the next level with converged infrastructure”.

Firstly, I wanted to thank everyone who attended the session, it was a great turnout and during the Q&A there were a ton of great questions.

One part of the presentation I got a lot of feedback on was when I spoke about Performance and Scaling and how this is a major issue with traditional shared storage.

So for those who couldn’t attend the session, I decided to create this post.

So lets start with a traditional environment with two VMware ESXi hosts, connected via FC or IP to a Storage array. In this example the storage controllers have a combined capability of 100K IOPS.

50kIOPS

As we have two (2) ESXi hosts, if we divide the performance capabilities of the storage controllers between the two hosts we get 50K IOPS per node.

This is an example of what I have typically seen in customer sites, and day 1, and performance normally meets the customers requirements.

As environments tend to grow over time, the most common thing to expand is the compute layer, so the below shows what happens when a third ESXi host is added to the cluster, and connected to the SAN.

33KIOPS

The 100K IOPS is now divided by 3, and each ESXi host now has 33K IOPS.

This isn’t really what customers expect when they add additional servers to an environment, but in reality, the storage performance is further divided between ESXi hosts and results in less IOPS per host in the best case scenario. Worst case scenario is the additional workloads on the third host create contention, and each host may have even less IOPS available to it.

But wait, there’s more!

What happens when we add a forth host? We further reduce the storage performance per ESXi host to 25K IOPS as shown below, which is HALF the original performance.

25KIOPS

At this stage, the customers performance is generally significantly impacted, and there is no easy or cost effective resolution to the problem.

….. and when we add a fifth host? We continue to reduce the storage performance per ESXi host to 20K IOPS which is less than half its original performance.

20KIOPS

So at this stage, some of you may be thinking, “yeah yeah, but I would also scale my storage by adding disk shelves.”

So lets add a disk shelf and see what happens.

20KIOPSAddDiskShelf

We still only have 100K IOPS capable storage controllers, so we don’t get any additional IOPS to our ESXi hosts, the result of adding the additional disk shelf is REDUCED performance per GB!

Make sure when your looking at implementing, upgrading or replacing your storage solution that it can actually scale both performance (IOPS/throughput) AND capacity in a linear fashion,otherwise your environment will to some extent be impacted by what I have explained above. The only ways to avoid the above is to oversize your storage day 1, but even if you do this, over time your environment will appear to become slower (and your CAPEX will be very high).

Also, consider the scaling increments, as a solutions ability to scale should not require you to replace controllers or disks, or have a maximum number of controllers in the cluster. it also should scale in both small, medium and large increments depending on the requirements of the customer.

This is why I believe scale out shared nothing architecture will be the architecture of the future and it has already been proven by the likes of Google, Facebook and Twitter, and now brought to market by Nutanix.

Traditional storage, no matter how intelligent does not scale linearly or granularly enough. This results in complexity in architecture of storage solutions for environments which grow over time and lead to customers spending more money up front when the investment may not be realised for 2-5 years.

I’d prefer to be able to Start small with as little as 3 nodes, and scale one node at a time (regardless of node model ie: NX1000 , NX3000 , NX6000) to meet my customers requirements and never have to replace hardware just to get more performance or capacity.

Here is a summary of the Nutanix scaling capabilities, where you can scale Compute heavy, storage heavy or a mix of both as required.

ScaingSolution