Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

One of the most common mistakes people make when designing solutions is making assumptions. Assumptions in short are things an architect has failed to investigate and/or validate which puts a project at risk of not delivering the desired business outcome/s.

A great example of a really bad assumption to make is what data reduction ratio a storage platform will deliver.

But what if a vendor offers a data reduction guarantee and promises to give you as much equipment required if the ratio is not achieved, you’re protected right? The risk of your assumption being wrong is mitigated with the promise of free storage. Hooray!

Let’s explore this for a minute using an example of one of the more ludicrous guarantees going around the industry at the moment:

A guarantee of 10:1 data reduction!

Let’s say we have 100TB of data, that means we’d only need 10TB right? This might only be say, 4RU of equipment which sounds great!

After deployment, we start migrating and we only get a more realistic 2:1 data reduction, at which point the project stalls due to lack of capacity.

I go back to the vendor and lets say, best case scenario they agree on the spot (HA!) to give you more equipment, its unlikely to be delivered in less than 4 weeks.

So your project is delayed a minimum of 4 weeks until the equipment arrives. You now need to go through your change control process and if you’re doing this properly it would be documented with detailed steps on how to install the equipment, including appropriate back out strategies in the event of issues.

Typically change control takes some time to prepare, go through approvals, documentation etc especially in larger mission critical environments.

When installing any equipment you should also have documented operational verification steps to ensure the equipment has been installed correctly and is highly available, performing as expected etc.

Now that the new equipment is installed, the project continues and all 100TB of your data has been migrated to the new platform. Hooray!

Now let’s talk about the ongoing implications of the assumption of 10:1 data reduction only resulting in a much more realistic 2:1 ratio.

We now have 5x more equipment than we expected, so assuming the original 10TB was 4RU, we would now have 20RU of equipment which is taking up valuable real estate in our datacenter, or which may have required you to lease another rack in your datacenter.

If the product you purchased was a SAN/NAS, you now have lower IOPS/GB as you have just added a bunch more disk shelves to the existing controllers. This is because the controllers have a finite amount of performance, and you’ve just added more drives for it to manage. More drives on a traditional two controller SAN/NAS is only a good thing if the controller is not maxed out, and with flash ever increasing in performance, Controllers will be assuming they are not already the bottleneck.

If the product was HCI, now you require considerably more network interfaces. Depending on the HCI platform, you may require more hypervisor licensing, further increasing CAPEX and OPEX.

Depending on the HCI product, can you even utilise the additional storage without changing the virtual machines configuration? It might sound silly but some products don’t distribute data throughout the cluster, rather having mirrored objects so you may even need to create more virtual disks or distribute the VMs to make use of the new capacity.

Then you need to consider if the HCI product has any scale limitations, as these may require you to redesign your solution.

What about operational expenses? We now have 5x more equipment, so our environmental costs such as power & cooling will increase significantly as will our maintenance windows where we now have to patch 5x more hypervisor nodes in the case of HCI.

Typically customers no longer size for 3-5 years due to the fact HCI is becoming the platform of choice compared to SAN/NAS. This is great but when your data reduction assumption is wrong, (in this example off by 5x) the ongoing impact is enormous.

This means as you scale, you need to scale at 5x the rate you originally designed for. That’s 5x more rack units (RU), 5x more Power, 5x more cooling required, potentially even 5x more hypervisor licensing.

What does all of this mean?

Your Total Cost of Ownership (TCO) and Return on Investment (ROI) goes out the window!

Interestingly, Nutanix recently considered offering a data reduction guarantee and I was one of many who objected and strongly recommended we not drop to the levels of other vendors just because it makes the sales cycle easier.

All of the reasons above and more were put to Nutanix product management and they made the right decision, even though Nutanix data reduction (and avoidance) is very strong, we did not want to put customers in a position where their business outcomes were potentially at risk due to assumptions.

Summary:

While data reduction is a valuable part of a storage platform, the benefits (data reduction ratio) can and do vary significantly between customers and datasets. Making assumptions on data reduction ratios even when vendors provide lots of data showing their averages and providing guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2, I will go through an example of how misleading data reduction guarantees can be.

Things to consider when choosing infrastructure.

With all the choice in the compute/storage market at the moment, choosing new infrastructure for your next project is not an easy task.

In my experience most customers (and many architects) think about the infrastructure coming up for replacement and look to do a “like for like” replacement with newer/faster technology.

An example of this would be a customer with a FC SAN running Oracle workloads where the customer or architect replaces the end of life Hybrid FC SAN with an All Flash FC SAN and continues running Oracle “as-is”.

Now I’m not saying there is anything wrong with that, however if we consider more than just the one workload, we may be able to achieve our business requirements with a more standardized and cost effective approach than having dedicated infrastructure for specific workloads.

So in this post, I am inviting you to consider the bigger picture.

If we take an example customer has the following workload requirements:

  1. Virtual Desktop (VDI)
  2. Virtualized Business Critical Applications (e.g.: SQL / Exchange)
  3. Long Term Archive (High Capacity, low IOPS)
  4. Business Continuity and Disaster Recovery

It is unlikely any one solution from any vendor is going to be the “best” in all areas as every solution has its pros and cons.

Regarding VDI, I would say most people would agree Hyperconverged Infrastructure (HCI) / Scale out type architectures are strong for VDI, however VDI can be successfully deployed on a traditional SAN/NAS solutions or using non shared local storage in the case of non-persistent desktops.

For vBCA, some people believe physical servers with JBOD storage is best for workloads like Exchange, and Physical + local SSD are best for Databases while many people are realising the benefits of virtualization of vBCA with shared storage such as SAN/NAS or on HCI.

For long term archive, cost per GB is generally one of if not the most critical factor where lots of trays of SATA storage connected to a small dual controller setup may be the most cost effective, whereas an All Flash array would be less likely considered in this use case.

For BC/DR, features such as a Storage Replication Adapter (SRA) for VMware Site Recovery Manager, a stretched cluster capability and some form of snapshot capability and replication would be typical requirements. Some newer technology can do per VM snapshots, whereas older style SAN/NAS technology may be per LUN, so newer technology would have an advantage here, but again, this doesn’t mean one tech should not be considered.

So what product do we choose for each workload type? The best of breed right?

Well, maybe not. Lets have a look at why you might not want to do that.

The below graph shows an example of 3 vendors being compared across the 4 categories I mentioned above being VDI, vBCA, Long Term Archive and BC/DR.

ExmapleGraph

The customer has determined that a score of 3 is required to meet their requirements so a solution failing to achieve a 3 or higher will not be considered (at least for that workload).

As we can see, for VDI Vendor B is the strongest, Vendor A second and Vendor C third, but when we compare BC/DR Vendor C is strongest followed by Vendor A and lastly Vendor B.

We can see for Long Term Archive Vendor A is the strongest with Vendor B and C tied for second place and finally for vBCA Vendor B is the strongest, Vendor A second and Vendor C third.

So if we chose the best vendor for each workload type (or the “Best of breed” solution) we would end up with three different vendors equipment.

  • VDI: Vendor B
  • Long Term Archive: Vendor A
  • BC/DR: Vendor C
  • vBCA: Vendor B

Is this a problem? Not necessarily but I would suggest that there are several things to consider including:

1. Having 3 different platforms to design/install/maintain

This means 3 different sets of requirements, constraints, risks, implications need to be considered.

Some large organisations may not consider this a problem, because they have a team for each area, but isn’t the fact the customer has to have multiple teams to manage infrastructure a problem in itself? Sounds like a significant (and potentially unnecessary) OPEX to me.

2. The best BC/DR solution does not meet the minimum requirements for the vBCA workloads.

In this example, the best BC/DR solution (Vendor C) is also the lowest rated for vBCA. As a result, Vendor C is not suitable for vBCA which means it should not be considered for BC/DR of vBCA. If Vendor C was used for BC/DR of the other workloads, then another product would need to be used for vBCA adding further cost/complexity to the environment.

3. Vendor A is the strongest at Long Term Archive, but has no interoperability with Vendor B and C

Due to the lack of interoperability, while Vendor A has the strongest Archiving solution, it is not suitable for this environment. In this example, the difference between the strongest Long Term Archive solution and the weakest is very small so Vendor B and C also meet the customers requirements.

 4. Multiple Silos of infrastructure may lead to inefficient use.

Just like in the days before Virtualization, we had the bulk of our servers CPU/RAM running at low utilization levels, we had our storage capacity carved up where we had lots of free space in one RAID pack but very little free space in others and we spent lots of time migrating workloads from LUN 1 to LUN 2 to free up capacity for new or existing workloads.

If we have 3 solutions, we may have many TB of available capacity in the VDI environment but be unable to share it with the Long Term Archiving. Or we may have lots of spare compute in VDI and be unable to share it with vBCA.

Now getting back to the graph, the below is the raw data.

rawdata

What we can see is:

  • Vendor B has the highest total (17.1)
  • Vendor A has the second highest total (14.8)
  • Vendor C has the lowest total (12)
  • Vendor C failed to meet the minimum requirements for VDI & vBCA
  • Vendor A and B met the minimum requirements for all areas

Let’s consider the impact of choosing Vendor B for all 4 workload types.

VDI – It was the highest rated, met the minimum requirements for the customer and is best of breed, so in this case Vendor B would be a solid choice.

vBCA – Again Vendor B was the highest rated, met the minimum requirements for the customer and is best of breed, so Vendor B would be a solid choice.

Long Term Archiving: Vendor B was equal last, but importantly met the customer requirements. Vendor A’s solution may have more features and higher performance, but as Vendor B met the requirements, the additional features and/or performance of Vendor A are not required. The difference between Vendor A (Best of Breed) and Vendor B was also minimal (0.5 rating difference) so Vendor B is again a solid choice.

BC/DR: Vendor B was the lowest rated solution for BC/DR, but again focusing on the customers requirements, the solution exceeded the minimum requirement of 3 comfortably with a rating of 4.2. Choosing Vendor B meets the requirements and likely avoids any interoperability and/or support issues, meaning a simpler overall solution.

Let’s think about some of the advantages for a customer choosing a standard platform for all workloads in the event a platform meets all requirements.

1. Lower Risk

Having a standard platform minimizes the chance of interoperability and support issues.

2. Eliminating Silos

As long as you can ensure performance meets requirements for all workloads (which can be difficult on centralized SAN/NAS deployments) then using a standard platform will likely lead to better utilization and higher return on investment (ROI).

3. Reduced complexity / Single Pane of Glass Management

Having one platform means not having to have SMEs in multiple technologies, or in larger organisations multiple SMEs per technology (for redundancy and/or workload) meaning reduced complexity, lower operational costs and possibly centralized management.

4. Lower CAPEX

This will largely depend on the vendor and quantity of infrastructure purchased, however many customers I have worked with have excellent pricing from a vendor as a result of standardizing.

Summary:

I am in no way saying “One size fits all” or that “every problem is a Nail” and recommending you buy a hammer. What I am saying is when considering infrastructure for your environment (or your customers), avoid tunnel vision and consider the other workloads or existing infrastructure in the environment.

In many cases the “Best of Breed” solution is not required and in fact implementing that solution may have significant implications in other areas of the environment.

In other cases, workloads may be so mission critical, that a best of breed solution may be the only way to meet the business requirements, in which case, a using a standard platform that may not meet the requirements would not be advised.

However if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest your doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.

Related Articles:

1. Enterprise Architecture & Avoiding tunnel vision.

2. Peak Performance vs Real World Performance

Data Centre Migration Strategies – Part 2 – Lift and Shift

Continuing on from Data Centre Migration Strategies Part 1 – Overview, Part 2 focuses on the “Lift and Shift” method.

I’m sure your reading this and already thinking, “this is the least interesting migration strategy, tell me about vMSC and SRM!” and well, your right, BUT it is important to understand the pros and cons so if you are ever in a situation where you have to use this method (I have on numerous occasions) that the migration is successful.

So what are the pros and cons of this method.

Pros

1. No need to purchase equipment for the new data centre
2. The environment should perform as it did at the original data centre following relocation
3.The approach is simple from a technical perspective ie: No new products are required
4. Low direct cost (Note: Point 8 in Cons)
5. Achieves a Recovery Point Objective (RPO) of zero (0).

Cons

1. The entire environment needs to be fully shut-down
2. The outage for the environment starts from when the servers are shut-down, until completion of operational verification testing at the new datacenter. Note: This may take several days depending on the size of the environment.
3. This method is high risk as the ability to fail back to the original datacenter requires all equipment be physically relocated back. This means the Recovery Time Objective (RTO) cannot be low.
4. The Lift and shift method cannot be tested until at least a significant amount of equipment has been physical relocated
5. In the event of an issue during operational verification at the new data centre, a decision needs to be made to proceed and troubleshoot the issues, OR at what point to fail back.
6. Depending on your environment, a vendor (eg: Storage) may need to revalidate your environment
7. Your migration (and schedule) are heavily dependant on the logistical side of the relocation which may have many factors (eg: Traffic / Weather) which are outside your control which may lead to delays or failed migration.
8. Potentially high indirect cost eg: Downtime, Loss of Business , productivity etc

When to use this method?

1. When purchasing equipment for the new data centre is not possible
2. When extended outages to the environment are acceptable
3. When you have no other options

Recommendations when using “Lift and Shift”

1. Ensure you have accurate wiring and rack diagrams of your datacenter
2. Be prepared with your vendor support contact details on hand as it is common following relocation of equipment to have hardware failures
3. Ensure you have an accurate Operational Verification document which tests every part of your environment from Layer 1 (Physical) all the way to Layer 7 (Application)
4. Label EVERYTHING as you disconnect it at the original datacenter
5. Prior to starting your data centre  migration, discuss and agree on a timeline for the migration and at what point and under what situation do you initiate a fail back.
6. Migrate the minimum amount of physical equipment that is required to get your environment back on-line and do your Operational Verification, then on successful completion of your Operational Verification migrate the remaining equipment. This allows for faster fail-back in the event Operational Verification fails.

In Part 3, we discuss Data centre migrations using VMware Site Recovery Manager. (Coming soon)