Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

One of the most common mistakes people make when designing solutions is making assumptions. Assumptions in short are things an architect has failed to investigate and/or validate which puts a project at risk of not delivering the desired business outcome/s.

A great example of a really bad assumption to make is what data reduction ratio a storage platform will deliver.

But what if a vendor offers a data reduction guarantee and promises to give you as much equipment required if the ratio is not achieved, you’re protected right? The risk of your assumption being wrong is mitigated with the promise of free storage. Hooray!

Let’s explore this for a minute using an example of one of the more ludicrous guarantees going around the industry at the moment:

A guarantee of 10:1 data reduction!

Let’s say we have 100TB of data, that means we’d only need 10TB right? This might only be say, 4RU of equipment which sounds great!

After deployment, we start migrating and we only get a more realistic 2:1 data reduction, at which point the project stalls due to lack of capacity.

I go back to the vendor and lets say, best case scenario they agree on the spot (HA!) to give you more equipment, its unlikely to be delivered in less than 4 weeks.

So your project is delayed a minimum of 4 weeks until the equipment arrives. You now need to go through your change control process and if you’re doing this properly it would be documented with detailed steps on how to install the equipment, including appropriate back out strategies in the event of issues.

Typically change control takes some time to prepare, go through approvals, documentation etc especially in larger mission critical environments.

When installing any equipment you should also have documented operational verification steps to ensure the equipment has been installed correctly and is highly available, performing as expected etc.

Now that the new equipment is installed, the project continues and all 100TB of your data has been migrated to the new platform. Hooray!

Now let’s talk about the ongoing implications of the assumption of 10:1 data reduction only resulting in a much more realistic 2:1 ratio.

We now have 5x more equipment than we expected, so assuming the original 10TB was 4RU, we would now have 20RU of equipment which is taking up valuable real estate in our datacenter, or which may have required you to lease another rack in your datacenter.

If the product you purchased was a SAN/NAS, you now have lower IOPS/GB as you have just added a bunch more disk shelves to the existing controllers. This is because the controllers have a finite amount of performance, and you’ve just added more drives for it to manage. More drives on a traditional two controller SAN/NAS is only a good thing if the controller is not maxed out, and with flash ever increasing in performance, Controllers will be assuming they are not already the bottleneck.

If the product was HCI, now you require considerably more network interfaces. Depending on the HCI platform, you may require more hypervisor licensing, further increasing CAPEX and OPEX.

Depending on the HCI product, can you even utilise the additional storage without changing the virtual machines configuration? It might sound silly but some products don’t distribute data throughout the cluster, rather having mirrored objects so you may even need to create more virtual disks or distribute the VMs to make use of the new capacity.

Then you need to consider if the HCI product has any scale limitations, as these may require you to redesign your solution.

What about operational expenses? We now have 5x more equipment, so our environmental costs such as power & cooling will increase significantly as will our maintenance windows where we now have to patch 5x more hypervisor nodes in the case of HCI.

Typically customers no longer size for 3-5 years due to the fact HCI is becoming the platform of choice compared to SAN/NAS. This is great but when your data reduction assumption is wrong, (in this example off by 5x) the ongoing impact is enormous.

This means as you scale, you need to scale at 5x the rate you originally designed for. That’s 5x more rack units (RU), 5x more Power, 5x more cooling required, potentially even 5x more hypervisor licensing.

What does all of this mean?

Your Total Cost of Ownership (TCO) and Return on Investment (ROI) goes out the window!

Interestingly, Nutanix recently considered offering a data reduction guarantee and I was one of many who objected and strongly recommended we not drop to the levels of other vendors just because it makes the sales cycle easier.

All of the reasons above and more were put to Nutanix product management and they made the right decision, even though Nutanix data reduction (and avoidance) is very strong, we did not want to put customers in a position where their business outcomes were potentially at risk due to assumptions.

Summary:

While data reduction is a valuable part of a storage platform, the benefits (data reduction ratio) can and do vary significantly between customers and datasets. Making assumptions on data reduction ratios even when vendors provide lots of data showing their averages and providing guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2, I will go through an example of how misleading data reduction guarantees can be.

The All-Flash Array (AFA) is Obsolete!

Over the last few years, I’ve had numerous customers ask about how Nutanix can support bare metal workloads. Up until recently, I haven’t had an answer the customers have wanted to hear.

As a result, some customers have been stuck using their exisiting SAN or worse still being forced to go out and buy a new SAN.

As a result many customers who have wanted to use or have already deployed hyperconverged infrastructure (HCI) for all other workloads are stuck managing an all flash array silo to service some bare metal workloads.

In June at .NEXT 2016, Nutanix announced Acropolis Block Services (ABS) which now allows bare metal workloads to be serviced by new or existing Nutanix clusters.

ABSoverview

As Nutanix has both hybrid (SSD+SATA) and all-flash nodes, customers can chose the right node type/s for their workloads and present the storage externally for bare metal workloads while also supporting Virtual Machines and Acropolis File Services (AFS) and containers.

So why would anyone buy an all-flash array? Let’s discuss a few scenarios.

Scenario 1: Bare metal workloads

Firstly, what applications even need bare metal these days? This is an important question to ask yourself. Challenge the requirement for bare metal and see if the justifications are still valid and if so, has anything changed which would allow virtualization of the applications. But this is a topic for another post.

If a customer only needs new infrastructure for bare metal workloads, deploying Nutanix and ABS means they can start small and scale as required. This avoids one of the major pitfalls of having to size a monolithic centralised, dual controller storage array.

While some AFA vendors can/do allow for non-disruptive controller upgrades, it’s still not a very attractive proposition, nor is it quick or easy. and reduces resiliency during the process as one of two controllers are offline. Nutanix on the other hand performs one click rolling upgrades which mean the largest the cluster, the lower the impact of an upgrade as it is performed one node at a time without disruption and can also be done without risk of a subsequent failure taking storage offline.

If the environment will only ever be used for bare metal workloads, no problem. Acropolis Block Services offers all the advantages of an All Flash Array, with far superior flexibility, scalability and simplicity.

Advantages:

  1. Start small and scale granularly as required allowing customers to take advantage of newer CPU/RAM/Flash technologies more frequently
  2. Scale performance and capacity by adding node/s
  3. Scale capacity only with storage-only nodes (which come in all flash)
  4. Automatically scale multi-pathing as the cluster expands
  5. Solution can support future workloads including multiple hypervisors / VMs / file services & containers without creating a silo
  6. You can use Hybrid nodes to save cost while delivering All Flash performance for workloads which require it by using VM flash pinning which ensures all data is stored in flash and can be specified on a per disk basis.
  7. The same ability as an all flash array to only add compute nodes.

Disadvantages:

  1. Your all-flash array vendor reps will hound you.

Scenario 2: Mixed workloads inc VMs and bare metal

As with scenario 1, deploying Nutanix and ABS means customers can start small and scale as required. This again avoids the major pitfall of having to size a monolithic centralised, dual controller storage array and eliminates the need for separate environments.

Virtual machines can run on compute+storage nodes while bare metal workloads can have storage presented by all nodes within the cluster, including storage-only nodes. For those who are concerned about (potential but unlikely) noisy neighbour situations, specific nodes can also be specified while maintaining all the advantages of Nutanix one-click, non-disruptive upgrades.

Advantages:

  1. Start small and scale granularly as required allowing customers to take advantage of newer CPU/RAM/Flash technologies more frequently
  2. Scale performance and capacity by adding node/s
  3. Scale capacity only with storage-only nodes (which also come in all flash)
  4. Automatically scale multi-pathing for bare metal workloads as the cluster expands
  5. Solution can support future workloads including multiple Hypervisors / VMs / file services & containers without creating a silo.

Disadvantages:

  1. Your All-Flash array vendor reps will hound you.

What are the remaining advantages of using an all flash array?

In all seriousness, I can’t think of any but for fun let’s cover a few areas you can expect all-flash array vendors to argue.

Performance

Ah the age old appendage measuring contest. I have written about this topic many times, including in one of my most popular posts “Peak performances vs Real world performance“.

The fact is, every storage product has limits, even all-flash arrays and Nutanix. The major difference is that Nutanix limits are per cluster rather than per Dual Controller Pair, and Nutanix can continue to scale the number of nodes in a cluster and continue to increase performance. So if ultimate performance is actually required, Nutanix can continue to scale to meet any performance/capacity requirements.

In fact, with ABS the limit for performance is not even at the cluster layer as multiple clusters can provide storage to the same bare metal server/s while maintaining single pane of glass management through PRISM Central.

I recently completed some testing with where I demonstrated the performance advantage of storage only nodes for virtual machines as well as how storage-only nodes improve performance for bare metal servers using Acropolis Block Services which I will be publishing results for in the near future.

Data Reduction

Nutanix has had support for deduplication, compression for a long time and introduced Erasure Coding (EC-X) mid 2015. Each of these technologies are supported when using Acropolis Block Services (ABS).

As a result, when comparing data reduction with all-flash array vendors, while the implementation of these data reduction technologies varies between vendors, they all achieves similar data reduction ratios when applied to the same dataset.

Beware of some vendors who include things like backups in their deduplication or data reduction ratios, this is very misleading and most vendors have the same capabilities. For more information on this see: Deduplication ratios – What should be included in the reported ratio?

Cost

Here we should think about what are the age old problems are with centralized shared storage (like AFAs)? Things like choosing the right controllers and the fact when you add more capacity to the storage, you’re not (or at least rarely) scaling the controller/s at the same time come to mind immediately.

With Nutanix and Acropolis Block Services you can start your All Flash solution with three nodes which means a low capital expenditure (CAPEX) and then scale either linearly (with the same node types) or non-linearly (with mixed types or storage only nodes) as you need to without having to rip and replace (e.g.: SAN controller head swaps).

Starting small and scaling as required also allows you to take advantage of newer technologies such as newer Intel chipsets and NVMe/3D XPoint to get better value for your money.

Starting small and scaling as required also minimizes – if not eliminates – the risk of oversizing and avoids unnecessary operational expenses (OPEX) such as rack space, power, cooling. This also reduces supporting infrastructure requirements such as networking.

Summary:

As shown below, the Nutanix Acropolis Distributed Storage Fabric (ADSF) can support almost any workload from VDI to mixed server workloads, file, block , big data, business critical applications such as SAP / Oracle / Exchange / SQL and bare metal workloads without creating silos with point solutions.

NutanixSingleFabricAllWorkloads

In addition to supporting all these workloads, Nutanix ADSF scalability both from a capacity/performance and resiliency perspective ensures customers can start small and scale when required to meet their exact business needs without the guesswork.

With these capabilities, the All-Flash array is obsolete.

I encourage everyone to share (constructively) your thoughts in the comments section.

Note: You must sign in to comment using WordPress, Facebook, LinkedIn or Twitter as Anonymous comments will not be approved,

Related Articles:

  1. Things to consider when choosing infrastructure.

  2. Scale out performance testing with Nutanix Storage Only Nodes

  3. What’s .NEXT 2016 – Acropolis Block Services (ABS)

  4. Scale out performance testing of bare metal workloads on Acropolis Block Services (Coming soon)

  5. What’s .NEXT 2016 – Any node can be storage only

  6. What’s .NEXT 2016 – All Flash Everywhere!

The truth about Storage Data efficiency ratios.

We’ve all heard the marketing claims from some storage vendors about how efficient their storage products are. Data efficiency ratios of 40:1 , 60:1 even 100:1 continue to be thrown around as if they are amazing, somehow unique or achieved as a result of proprietary hardware.

Let’s talk about how vendors may try to justify these crazy ratios:

For many years, Storage vendors have been able to take space efficient copies of LUNs, Datastores, Virtual Machines etc which rely on snapshots or metadata. These are not full copies and reporting this as data efficiency is quite mis-leading in my opinion as this is and has been for many years Table stakes.

Be wary of vendors encouraging (or requiring) you configure more frequent “backups” (which are after all just Snapshots or metadata copies) to achieve the advertised data efficiencies.

  • Reporting VAAI/VCAI clones as full copies

If I have a VMware Horizon View environment, It makes sense to use VAAI/VCAI space efficient clones as they provide numerous benefits including faster provisioning, recompose and use less space which leads to them being served from cache (making performance better).

So if I have an environment with just 100 desktops deployed via VCAI, You have a 100:1 data reduction ratio, 1000 desktops and you have 1000:1. But this is again Table stakes… well sort of because some vendors don’t support VAAI/VCAI and others only have partial support as I discuss in Not all VAAI-NAS storage solutions are created equal.

Funnily enough, one vendor even offloads what VAAI/VCAI can do (with almost no overhead I might add) to proprietary hardware. Either way, while VAAI/VCAI clones are fantastic and can add lots of value, claiming high data efficiency ratios as a result is again mis-leading especially if done so in the context of being a unique capability.

  • Compression of Highly compressible data

Some data, such as Logs or text files are highly compressible, so ratios of >10:1 for this type of data are not uncommon or unrealistic. However consider than if logs only use a few GB of storage, then 10:1 isn’t really saving you that much space (or money).

For example a 100:1 data reduction ratio of 100MB of logs is only saving you ~10GB which is good, but not exactly something to make a purchasing decision on.

Also compression of databases which lots of white space also compress very well, so the larger the Initial size of the DB, the more it will compress.

The compression technology used by storage vendors is not vastly different, which means for the same data, they will all achieve a similar reduction ratio. As much as I’d love to tell you Nutanix has much better ratios than Vendors X,Y and Z, its just not true, so I’m not going to lie to you and say otherwise.

  • Deduplication of Data which is deliberately duplicated

An example of this would be MS Exchange Database Availability Groups (DAGs). Exchange creates multiple copies of data across multiple physical or virtual servers to provide application and storage level availability.

Deduplication of this is not difficult, and can be achieved (if indeed you want to dedupe it) by any number of vendors.

In a distributed environment such as HCI, you wouldn’t want to deduplicate this data as it would force VMs across the cluster to remotely access more data over the network which is not what HCI is all about.

In a centralised SAN/NAS solution, deduplication makes more sense than for HCI, but still, when an application is creating the duplicate data deliberately, it may be a good idea to exclude it from being deduplicated.

As with compression, for the same data, most vendors will achieve a similar ratio so again this is table stakes no matter how each vendor tries to differentiate. Some vendors dedupe at more granular levels than others, but this provides diminishing returns and increased overheads, so more granular isn’t always going to deliver a better business outcome.

  • Claiming Thin Provisioning as data efficiency

If you have a Thin Provisioned 1TB virtual disk and you only write 50GB to the disk, you would have a data efficiency ratio of 20:1. So the larger you create your virtual disk and the less data you write to it, the better the ratio will be. Pretty silly in my opinion as Thin Provisioning is nothing new and this is just another deceptive way to artificially improve data efficiency ratios.

  • Claiming removal of zeros as data reduction

For example, if you create an Eager Zero Thick VMDK, then use only a fraction, as with the Thin Provisioning example (above), removal of zeros will obviously give a really high data reduction ratio.

However Intelegent storage doesn’t need Eager Zero Thick (EZT) VMDKs to give optimal performance nor will they write zeros to begin with. Intelligent storage will simply store metadata instead of a ton of worthless zeros. So a data reduction ratio from a more intelligent storage solution would be much lower than a vendor who has less intelligence and has to remove zeros. This is yet another reason why data efficiency (marketing) numbers have minimal value.

Two of the limited use cases for EZT VMDKs is Fault Tolerance (who uses that anyway) and Oracle RAC, so removal of zeros with intelligent storage is essentially moot.

Summary:

Data reduction technologies have value, but they have been around for a number of years so if you compare two modern storage products, you are unlikely to see any significant difference between vendor A and B (or C,D,E,F and G).

The major advantage of data reduction is apparent when comparing new products with 5+ year old technology. If you are in this situation where you have very old tech, most newer products will give you a vast improvement, it’s not unique to just one vendor.

At the end of the day, there are numerous factors which influence what data efficiency ratio can be achieved by a storage product. When comparing between vendors, if done in a fair manner, the differences are unlikely to be significant enough to sway a purchasing decision as most modern storage platforms have more than adequate data reduction capabilities.

Beware: Dishonest and mis-leading marketing about data reduction is common so don’t get caught up in a long winded conversations about data efficiency or be tricked into thinking one vendor is amazing and unique in this area, it just isn’t the case.

Data reduction is table stakes and really shouldn’t be the focus of a storage or HCI purchasing decision.

My recommendation is focus on areas which deliver operational simplicity, removes complexity/dependancies within the datacenter and achieve real business outcomes.

Related Posts:

1. Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

2. Sizing infrastructure based on vendor Data Reduction assumptions – Part 2

3.Deduplication ratios – What should be included in the reported ratio?