Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

One of the most common mistakes people make when designing solutions is making assumptions. Assumptions in short are things an architect has failed to investigate and/or validate which puts a project at risk of not delivering the desired business outcome/s.

A great example of a really bad assumption to make is what data reduction ratio a storage platform will deliver.

But what if a vendor offers a data reduction guarantee and promises to give you as much equipment required if the ratio is not achieved, you’re protected right? The risk of your assumption being wrong is mitigated with the promise of free storage. Hooray!

Let’s explore this for a minute using an example of one of the more ludicrous guarantees going around the industry at the moment:

A guarantee of 10:1 data reduction!

Let’s say we have 100TB of data, that means we’d only need 10TB right? This might only be say, 4RU of equipment which sounds great!

After deployment, we start migrating and we only get a more realistic 2:1 data reduction, at which point the project stalls due to lack of capacity.

I go back to the vendor and lets say, best case scenario they agree on the spot (HA!) to give you more equipment, its unlikely to be delivered in less than 4 weeks.

So your project is delayed a minimum of 4 weeks until the equipment arrives. You now need to go through your change control process and if you’re doing this properly it would be documented with detailed steps on how to install the equipment, including appropriate back out strategies in the event of issues.

Typically change control takes some time to prepare, go through approvals, documentation etc especially in larger mission critical environments.

When installing any equipment you should also have documented operational verification steps to ensure the equipment has been installed correctly and is highly available, performing as expected etc.

Now that the new equipment is installed, the project continues and all 100TB of your data has been migrated to the new platform. Hooray!

Now let’s talk about the ongoing implications of the assumption of 10:1 data reduction only resulting in a much more realistic 2:1 ratio.

We now have 5x more equipment than we expected, so assuming the original 10TB was 4RU, we would now have 20RU of equipment which is taking up valuable real estate in our datacenter, or which may have required you to lease another rack in your datacenter.

If the product you purchased was a SAN/NAS, you now have lower IOPS/GB as you have just added a bunch more disk shelves to the existing controllers. This is because the controllers have a finite amount of performance, and you’ve just added more drives for it to manage. More drives on a traditional two controller SAN/NAS is only a good thing if the controller is not maxed out, and with flash ever increasing in performance, Controllers will be assuming they are not already the bottleneck.

If the product was HCI, now you require considerably more network interfaces. Depending on the HCI platform, you may require more hypervisor licensing, further increasing CAPEX and OPEX.

Depending on the HCI product, can you even utilise the additional storage without changing the virtual machines configuration? It might sound silly but some products don’t distribute data throughout the cluster, rather having mirrored objects so you may even need to create more virtual disks or distribute the VMs to make use of the new capacity.

Then you need to consider if the HCI product has any scale limitations, as these may require you to redesign your solution.

What about operational expenses? We now have 5x more equipment, so our environmental costs such as power & cooling will increase significantly as will our maintenance windows where we now have to patch 5x more hypervisor nodes in the case of HCI.

Typically customers no longer size for 3-5 years due to the fact HCI is becoming the platform of choice compared to SAN/NAS. This is great but when your data reduction assumption is wrong, (in this example off by 5x) the ongoing impact is enormous.

This means as you scale, you need to scale at 5x the rate you originally designed for. That’s 5x more rack units (RU), 5x more Power, 5x more cooling required, potentially even 5x more hypervisor licensing.

What does all of this mean?

Your Total Cost of Ownership (TCO) and Return on Investment (ROI) goes out the window!

Interestingly, Nutanix recently considered offering a data reduction guarantee and I was one of many who objected and strongly recommended we not drop to the levels of other vendors just because it makes the sales cycle easier.

All of the reasons above and more were put to Nutanix product management and they made the right decision, even though Nutanix data reduction (and avoidance) is very strong, we did not want to put customers in a position where their business outcomes were potentially at risk due to assumptions.

Summary:

While data reduction is a valuable part of a storage platform, the benefits (data reduction ratio) can and do vary significantly between customers and datasets. Making assumptions on data reduction ratios even when vendors provide lots of data showing their averages and providing guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2, I will go through an example of how misleading data reduction guarantees can be.