Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

One of the most common mistakes people make when designing solutions is making assumptions. Assumptions in short are things an architect has failed to investigate and/or validate which puts a project at risk of not delivering the desired business outcome/s.

A great example of a really bad assumption to make is what data reduction ratio a storage platform will deliver.

But what if a vendor offers a data reduction guarantee and promises to give you as much equipment required if the ratio is not achieved, you’re protected right? The risk of your assumption being wrong is mitigated with the promise of free storage. Hooray!

Let’s explore this for a minute using an example of one of the more ludicrous guarantees going around the industry at the moment:

A guarantee of 10:1 data reduction!

Let’s say we have 100TB of data, that means we’d only need 10TB right? This might only be say, 4RU of equipment which sounds great!

After deployment, we start migrating and we only get a more realistic 2:1 data reduction, at which point the project stalls due to lack of capacity.

I go back to the vendor and lets say, best case scenario they agree on the spot (HA!) to give you more equipment, its unlikely to be delivered in less than 4 weeks.

So your project is delayed a minimum of 4 weeks until the equipment arrives. You now need to go through your change control process and if you’re doing this properly it would be documented with detailed steps on how to install the equipment, including appropriate back out strategies in the event of issues.

Typically change control takes some time to prepare, go through approvals, documentation etc especially in larger mission critical environments.

When installing any equipment you should also have documented operational verification steps to ensure the equipment has been installed correctly and is highly available, performing as expected etc.

Now that the new equipment is installed, the project continues and all 100TB of your data has been migrated to the new platform. Hooray!

Now let’s talk about the ongoing implications of the assumption of 10:1 data reduction only resulting in a much more realistic 2:1 ratio.

We now have 5x more equipment than we expected, so assuming the original 10TB was 4RU, we would now have 20RU of equipment which is taking up valuable real estate in our datacenter, or which may have required you to lease another rack in your datacenter.

If the product you purchased was a SAN/NAS, you now have lower IOPS/GB as you have just added a bunch more disk shelves to the existing controllers. This is because the controllers have a finite amount of performance, and you’ve just added more drives for it to manage. More drives on a traditional two controller SAN/NAS is only a good thing if the controller is not maxed out, and with flash ever increasing in performance, Controllers will be assuming they are not already the bottleneck.

If the product was HCI, now you require considerably more network interfaces. Depending on the HCI platform, you may require more hypervisor licensing, further increasing CAPEX and OPEX.

Depending on the HCI product, can you even utilise the additional storage without changing the virtual machines configuration? It might sound silly but some products don’t distribute data throughout the cluster, rather having mirrored objects so you may even need to create more virtual disks or distribute the VMs to make use of the new capacity.

Then you need to consider if the HCI product has any scale limitations, as these may require you to redesign your solution.

What about operational expenses? We now have 5x more equipment, so our environmental costs such as power & cooling will increase significantly as will our maintenance windows where we now have to patch 5x more hypervisor nodes in the case of HCI.

Typically customers no longer size for 3-5 years due to the fact HCI is becoming the platform of choice compared to SAN/NAS. This is great but when your data reduction assumption is wrong, (in this example off by 5x) the ongoing impact is enormous.

This means as you scale, you need to scale at 5x the rate you originally designed for. That’s 5x more rack units (RU), 5x more Power, 5x more cooling required, potentially even 5x more hypervisor licensing.

What does all of this mean?

Your Total Cost of Ownership (TCO) and Return on Investment (ROI) goes out the window!

Interestingly, Nutanix recently considered offering a data reduction guarantee and I was one of many who objected and strongly recommended we not drop to the levels of other vendors just because it makes the sales cycle easier.

All of the reasons above and more were put to Nutanix product management and they made the right decision, even though Nutanix data reduction (and avoidance) is very strong, we did not want to put customers in a position where their business outcomes were potentially at risk due to assumptions.

Summary:

While data reduction is a valuable part of a storage platform, the benefits (data reduction ratio) can and do vary significantly between customers and datasets. Making assumptions on data reduction ratios even when vendors provide lots of data showing their averages and providing guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2, I will go through an example of how misleading data reduction guarantees can be.

VCDX Defence Essentials – Part 2 – Preparing for the Design Scenario

Following on from Part 1 – Preparing for the Design Defence, Part 2 covers my tips for the Design Scenario part of the VCDX defence.

After a short break following your 75min Design defence, your neck deep in the Design Scenario. You are presented with a scenario which you need to demonstrate your abilities to gather requirements and while you will not be able to complete a design in 30mins, you should be able to demonstrate the methodology you use to start the process.

As mentioned in Part 1, I am not a official panellist and I do not know how the scoring works. The below is my advice based on conducting mock panels, the success rate of candidates I have conducted mock panels with and my successfully achieving VCDX on the 1st attempt.

Common Mistakes

1. Not gathering and identifying requirements/constraints & risks

The design scenario is very high level, and does not provide you with all the information required to be able to properly start a design. Not identifying and clarifying requirements/constraints and risks will in most cases prevent a candidate from successfully being able to start the design process.

Note: The word “Start” is underlined! You can’t start a design without knowing what your designing for… so don’t make this mistake.

2. Not documenting the requirements/constraints & risks

Assuming you have not made Mistake #1, and you have gathered and clarified the requirements/constraints & risks, the next mistake is not to write them down. I have seen many candidates do an excellent job of gathering the information, to then fall in a heap because they waste time asking the same questions over again because the have forgotten the details.

30 mins is not a long time, you cannot afford to waste time repeating questions.

3. Going down a rabbit hole

I have observed many candidates who are clearly very knowledgeable, who have spent 10-15 mins talking about one topic, such as HA and going into admission control options and pros/cons, isolation response etc. They demonstrated lots of expertise, but this did not help getting as much progress as possible into a design within the time constraint.

The design may be excellent in one key area (eg: HA) but severely lacked in all other areas, which would certainly led to a low score in the design scenario.

4. Not adjusting to changes

The information given to you in the design scenario may not always be correct and may even change half way through the design. Just like in a customer meeting, the customer doesn’t always know the answers to your questions, and may give you an incorrect answer, or simply not know the answer, then later on, realise they gave you incorrect information and correct themselves.

I deliberately throw curve-balls into mock design scenarios and I have observed several times a candidate be say 25 mins into the design scenario and this happens, and they failed to adjust for whatever reason/s.

5. Being Mute!

I have seen candidates who stand starring at the whiteboard, or drawing away madly, while completely mute. Then after 5-10 mins of drawing/thinking candidates then talk about what they came up with.

Do you stand in customer meetings mute? No! (Well, you shouldn’t!)

 

Tips for the Design Scenario

1. Clarify the Requirements/Constraints

Start by clarifying the information that has been provided to you. The information provided may be contradictory, so get this sorted before going any further.

2. Write the requirements/constraints & risks on the Whiteboard

Once you have clarified a piece of information that has been provided to you, write it on the whiteboard under section heading, such as:

a) Requirements

b) Constraints

c) Risks

d) Assumptions

Now, you can quickly review these items, without having to remember everything and if a curve-ball is thrown at you, you can cross out the incorrect information and write down the correct info and this may assist you modifying your design to cater for the changed requirement/constraint etc.

As you work through the scenario, you may be able to clarify an assumption, so you can remove it as an assumption/risk, this shows your working towards a quality outcome.

3. Write down your decisions!

Ensure you address each of the key areas of a vSphere solution by writing on the whiteboard headings like the following:

a) Storage

b) Networking

c) Compute

d) Availability

e) Datacenter

Ensure you write down at least 3 items per section, so you are covering off the entire environment.

As you make a design choice, write it down, eg: under storage, you may be recommending or constrained to use iSCSI, so write it down. iSCSI / Block storage.

So, aim to have 5 section headings like the above examples, and at  least 3 items per heading by the end of 30 mins. If you do the math, that’s only 6 mins per section, or 2 mins per item so make them count.

eg: Availability does not just mean N+1 vSphere cluster, what about say, environmental items such as UPS? A successful VCDX level design is not just about vSphere.

4. Verbalize your thought process.

I cannot give you advise, if I don’t know what your thinking! Same with the panellists, they can’t score you if you don’t verbalize your thought process.

No matter what, keep thinking out loud, if your working through options in your mind, that’s what the panel want’s to hear, so let them hear it!

If you are mute for a large portion of the 30 mins, the lower the chances you have of increasing your score.

5. Show how you adjust to changes in requirements/constraints/assumptions!

As a VCDX candidate, your most likely an architect day to day, so you would have dealt with this many times in real life, so deal with it in the design scenario!

If your 25 mins into the design scenario, and the panel suddenly tells you the CIO went out drinking on the weekend with his new buddy at storage vendor X and decided to scrap the old vendors storage and go for another vendor, deal with it!

Talk about the implications of moving from vendor X to vendor Y, for example FC to NFS and how this would change the design and would it still meet the requirements or would it be a risk?

6. Don’t be afraid to draw diagrams – but don’t spend all day making it pretty!

Use the whiteboard to draw your solution as it develops, but don’t waste time drawing fancy diagrams. A square box with ESXi written in it, is a Host, it doesn’t need to be pretty.

eg: If your drawing a 16 node cluster, draw three squares, Labelled ESXi01, ESXi…. and ESXi16, don’t draw 16 boxes, this adds no value, wastes time, and makes the diagram harder to draw.

 

Summary

I hope the above tips help you prepare for the VCDX design scenario and best of luck with your VCDX journey. For those who are interested, you can read about My VCDX Journey.

In Part 3, I will go through Preparing for the Troubleshooting Scenario, and how to maximize your 15 mins.

Example Architectural Decision – Site Recovery Manager Server – Physical or Virtual?

Problem Statement

To ensure Production vSphere environment/s can meet/exceed the required RTOs in the event of a declared site failure, What is the most suitable way to deploy VMware Site Recovery Manager, on a Physical or Virtual machine?

Requirements

1. Meet/Exceed RTO requirements

2. Ensure solution is fully supported

3. SRM be highly available, or be able to be recovered rapidly to ensure Management / Recovery of the Virtual infrastructure

4. Where possible, reduce the CAPEX and OPEX for the solution

5. Ensure the environment can be easily maintained in BAU

Assumptions

1. Sufficient compute capacity in the Management cluster for an additional VM

2. SRM database is hosted on an SQL server

3. vSphere Cluster (ideally Management cluster)  has N+1 availability

Constraints

1. None

Motivation

1. Reduce CAPEX and OPEX

2. Reduce the complexity of BAU maintenance / upgrades

3. Reduce power / cooling / rackspace usage in datacenter

Architectural Decision

Install Site Recovery Manager on a Virtual machine

Justification

1. Ongoing datacenter costs relating to Power / Cooling and Rackspace are avoided

2. Placing Site Recovery Management on a Virtual machine ensures the application benefits from the availability, load balancing, and fault resilience capabilities provided by vSphere

3. The CAPEX of a virtual machine is lower than a physical system especially when taking into consideration network/storage connectivity for the additional hardware where a physical server was used

4. The OPEX of a virtual machine is lower than a physical system due to no hardware maintenance, minimal/no additional power usage , and no cooling costs

3. Improved scale-ability and the ability to dynamically add additional resources (where required) assuming increased resource consumption by the VM. Note: The guest operating system must support Hot Add / Hot Plug and be enabled while the VM is shutdown. Where these features are not supported, virtual hardware can be added with a short outage.

4. Improved manageability as the VMware abstraction layer makes day to day tasks such as backup/recovery easier

5. Ability to non-disruptively migrate to new hardware where EVC is configured in compatible mode and enabled between hosts within a vSphere data center

Alternatives

1. Place SRM on a physical server

Implications

1. For some storage arrays, the SRM server needs to have access to admin LUNs and using a virtual machine may increase complexity by the requirement for RDMs

I would like to Thank James Wirth VCDX#83 (@jimmywally81) for his contribution to this example architectural decision.

Related Articles

1. Site Recovery Manager Deployment Location

2. Swap file location for SRM protected VMs

CloudXClogo