Sizing infrastructure based on vendor Data Reduction assumptions – Part 2

In part 1, we discussed how data reduction ratios can, and do, vary significantly between customers and datasets and that making assumptions on data reduction ratios, even when vendors provide guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2 we will go through an example of how misleading data reduction guarantees can be.

One HCI manufacturer provides a guarantee promising 10:1 which sounds too good to be true, and that’s because it, quite frankly, isn’t true. The guarantee includes a significant caveat for the 10:1 data reduction:

The savings/efficiency are based on the assumption that you configure a backup policy to take at least one <redacted> backup per day of every virtual machine on every<redacted> system in a given VMware Datacenter with those backups retained for 30 days.

I have a number of issues with this limitation including:

  1. The use of the word “backup” referring directly/indirectly to data reduction (savings)
  2. The use of the word “backup” when referring to metadata copies within the same system
  3. No actual deduplication or compression is required to achieve the 10:1 data reduction because metadata copies (or what the vendor incorrectly calls “backups”) are counted towards deduplication.

It is important to note, I am not aware of any other vendor who makes the claim that metadata copies ( Snapshots / Point in time copies / Recovery points etc.) are deduplication. They simply are not.

I have previously written about what should be counted in deduplication ratios, and I encourage you to review this post and share your thoughts as it is still a hot topic and one where customers are being oversold/mislead regularly in my experience.

Now let’s do the math on my claim that no actual deduplication or compression is required to achieve the 10:1 ratio.

Let’s use a single 1 TB VM as a simple example. Note: The size doesn’t matter for the calculation.

Take 1 “backup” (even though we all know this is not a backup!!) per day for 30 days and count each copy as if it was a full backup to disk, Data logically stored now equals 31TB  (1 TB + 30 TB).

The actual Size on disk is only a tiny amount of metadata higher than the original 1TB as the metadata copies pointers don’t create any copies of the data which is another reason it’s not a backup.

Then because these metadata copies are counted as deduplication, the vendor reports a data efficiency of 31:1 in its GUI.

Therefore, Effective Capacity Savings = 96.8% (1TB / 31TB = 0.032) which is rigged to be >90% every time.

So the only significant capacity savings which are guaranteed come from “backups” not actual reduction of the customer’s data from capacity saving technologies.

As every modern storage platform I can think of has the capability to create metadata based point in time recovery points, this is not a new or even a unique feature.

So back to our topic, if you’re sizing your infrastructure based on the assumption of the 10:1 data efficiency, you are in for rude shock.

Dig a little deeper into the “guarantee” and we find the following:

It’s the ratio of storage capacity that would have been used on a comparable traditional storage solution to the physical storage that is actually used in the <redacted> hyperconverged infrastructure.  ‘Comparable traditional solutions’ are storage systems that provide VM-level synchronous replication for storage and backup and do not include any deduplication or compression capability.

So if you, for example, had a 5 year old NetApp FAS, and had deduplication and/or compression enabled, the guarantee only applies if you turned those features off, allowed the data to be rehydrated and then compared the results with this vendor’s data reduction ratio.

So to summarize, this “guarantee” lacks integrity because of how misleading it is. It  is worthless to any customer using any form of enterprise storage platform probably in the last 5 –  10 years as the capacity savings from metadata based copies are, and have been, table stakes for many, many years from multiple vendors.

So what guarantee does that vendor provide for actual compression and deduplication of the customers data? The answer is NONE as its all metadata copies or what I like to call “Smoke and Mirrors”.

Summary:

“No one will question your integrity if your integrity is not questionable.” In this case the guarantee and people promoting it have questionable integrity especially when many customers may not be aware of the difference between metadata copies and actual copies of data, and critically when it comes to backups. Many customers don’t (and shouldn’t have too) know the intricacies of data reduction, they just want an outcome and 10:1 data efficiency (saving) sounds to any reasonable person as they need 10x less than I have now… which is clearly not the case with this vendors guarantee or product.

Apart from a few exceptions which will not be applicable for most customers, 10:1 data reduction is way outside the ballpark of what is realistically achievable without using questionable measurement tactics such as counting metadata copies / snapshots / recovery points etc.

In my opinion the delta in the data reduction ratio between all major vendors in the storage industry for the same dataset, is not a significant factor when making a decision on a platform. This is because there are countless other substantially more critical factors to consider. When the topic of data reduction comes up in meetings I go out of my way to ensure the customer understands this and has covered off the other areas like availability, resiliency, recoverability, manageability, security and so on before I, quite frankly waste their time talking about table stakes capability like data reduction.

I encourage all customers to demand nothing less of vendors than honestly and integrity and in the event a vendor promises you something, hold them accountable to deliver the outcome they promised.

NFS Storage and the “Block Dinosaur”

Disclaimer: If you don’t have a sense of humour and/or you just really love block storage, Parental Guidance is recommend.

23-Apr-15 8-42-18 PM

For as long as I can remember it has not been uncommon for I.T “professionals” working in the storage industry or in a storage role to make statements about NFS (Network File System) as if its is a 2nd class citizen in the storage world.

I’ve heard any number of statements such as:

  • NFS is slow(er) than block storage
  • NFS (datastores) don’t honour all SCSI commands
  • NFS is not scalable
  • NFS uses significantly more CPU than block storage
  • NFS does not support <insert your favourite technology here>

People making these statements are known as “Block Dinosaurs

The definition of “Block Dinosaur” is as follows:

“Block Dinosaur”

 Pronounced: [blok] – [dahy-nuh-sawr]

Examples

noun
  1. a homo sapien becoming less common in the wild since the widespread use of NFS with vSphere and Hyper-Converged solutions
  2. a species soon to be extinct, of which attempts to spread Fear Uncertainty and Doubt (FUD) about the capabilities of NFS storage
  3. someone that provides storage which is unwieldy in size, inflexible and requires an outdated technologies such as “LUNs” , “Zoning” & “Masking”.
  4. a person unable to adapt to change who continues to attempt to sell outdated equipment: e.g.: The SAN dinosaur recommended an outdated product that was complicated and cost the company millions to install and operate.
  5. a person who does not understand SCSI protocol emulation and/or has performed little/no practical testing of NFS storage in which to have an informed opinion;
  6. a person who drinks from the fire hose of their respective employer or predominately block storage vendor;

Synonyms for “Block Dinosaur”

  1. SAN zombie
  2. Old-School SAN salesman
  3. SAN hugger
Origin of “Block dinosaur”
Believed to have originated in Hopkinton, MA, USA but quickly spread to Santa Clara, California and onto Armonk, NY before going global after frequent “parroting” of anti NAS or NFS statements.
Recent “Block Dinosaur” sightings:
  holb090203_cmyk-735392
The only cool “Block Dinosaurs” are a different species and can only be found at Lego Land.
brickdinosaurlego-dino-legoland--large-msg-12161441386298IMG_1464[activities]Saurus3_414x2
Final (and more serious) Thought:
I hope this post came across as light hearted as its not meant to upset anyone, at the same time, I would really like the ridiculous debate about Block vs File storage be put to bed, its 2015 people, there is much more important things to worry about.
The fact is there are advantages to both block and file storage and reasons where you may use one over another depending on requirements. At the end of the day both can provide enterprise grade storage solutions which provide business outcomes to customers, so there is no need to bash one or the other.

Calculating Actual Usable capacity? It’s not as simple as you might think! – Part 1 SAN/NAS

Calculating the usable capacity for your next SAN/NAS is easy. Work out the number of drives you have, what RAID config your going to use and your done, right?!

Wrong! There are numerous factors which come into play to understand the ACTUAL or TRUE usable capacity of a SAN/NAS solution.

So let’s take an example of a traditional SAN/NAS using RAID and work out how much space we can actually use.

Note this is a simplified and generic example, which will vary from vendor to vendor.

Let’s say a SAN/NAS has 100 x 1TB drives (Note: The type of drive is not important for this example) and has the requirement to support mixed workloads such as MS SQL , MS Exchange and general server workloads.

As per vendor best practices, RAID 10 is used to maximize IOPS for SQL / Oracle and other storage intensive applications, RAID 5 is used for things like MS Exchange and RAID 6 (or DP) is used for general server workloads.

The vendor also recommends one hot spare drive per 2 disk shelves to ensure when drives fail, there are sufficient hot spares available.

So let’s start with 100TB RAW and see where things end up.

1. Deducting hot spare drives

So assuming 14 drives per shelf, that’s 7 drives (or 7TB RAW) dedicated to hot spares.

100TB – 7TB = 93TB

2. RAID Overhead

Let’s assume 20% of our workloads require RAID 10, so 20 drives are used. RAID 10 has a usable capacity of 50% so 20TB – 50% = 10TB

Next let’s say 40% of our workloads use RAID 5, so 40 drives broken up into 5 x RAID 5s each with 8 drives in a 7+1 Parity configuration. Therefore with 5 x RAID 5s volumes we loose 5 drives (5TB RAW) worth of capacity.

The final 40% of our workloads use RAID6 (or DP), so 40 drives broken up into 5 x RAID 6s each with 8 drives in a 6+2 Parity configuration therefore with 5 x RAID 6s we loose 10 drives (10TB RAW) worth of capacity.

93TB – 10TB (RAID 10) – 5TB (RAID5) – 10TB (RAID6) = 68TB remaining

3. Free Space on the platform required to ensure performance

For most traditional storage solutions, the vendors recommend ensuring a specific percentage of free space to ensure performance remains consistent.

For some vendors this is 20% and others say around 30%.

For this example, I will assume best case scenario of 20%.

68TB – 20% (Free space for performance) = 54.4TB

4. Free space per LUN

Vendors typically recommend having between 10-20% free space per LUN to account for unexpected growth, VM level snapshots etc. This makes perfect sense as if a LUN runs out of space, its a bad day for the I.T dept.

For this example, I will assume only 10% free space per LUN but it could easily be 20% further reducing usable capacity.

54.4TB – 10% (Free space per LUN) = 48.96TB

5. Free space per VMDK

As with physical servers, we don’t want our VMs drives running out of capacity, as a result it is common to size VMDKs well above what is strictly required to make capacity management (operational tasks) easier.

I typically see architects recommending upwards of 10-20% free space per VMDK over and above what is required to account for unexpected growth, OS patching etc. This makes perfect sense for the same reason as we have free space per LUN because if space runs out for a VM, it’s another bad day for I.T.

For this example, I will assume only 10% free space per VMDK.

48.96TB – 10% (Free space per VMDK) = 44.064TB

Now where are we at?

So far, the first 5 points are fairly easy to calculate and if you agree or not with the specific examples or percentage deductions, I’d suggest few would disagree these are factors which reduce usable disk space for traditional SAN/NAS deployments.

Next we will look at various factors which further reduce usable capacity. Each of these factors will vary from customer to customer, which further complicates the sizing exersize and results in lower usable capacity than what you may believe.

6. Silos for Performance

In this example, we have assumed only 20% of our drives are configured for high I/O with RAID 10, but in many cases the drives required for performance could be a much higher percentage.

Now to get the IOPS required for these storage intensive applications, its common to see the capacity utilization of the LUNs be much lower than the usable capacity because the storage is IOPS constrained, not capacity.

This leads to Silos of drives with low utilization, where the remaining capacity cannot (or at least should not) be shared with other VMs as this would likely impact the performance of the IO intensive VMs.

So for example, if our RAID 10 LUNs have 50% free space (which I personally have found to be common) then we’re effectively wasting 5TB (50% of the RAID 10s 10TB usable).

44.064TB – 10% (Wasted Capacity for Performance Silos) = 39.65TB

7. Silos of (or Fragmented) Usable Capacity

In this example, we have assumed 40% of our drives are configured for RAID 5 and the remaining 40% for RAID 6 (DP) to suit the different workloads in this environment, as a result we have 2 “Silos” of usable capacity.

In this post I have described 5 x 8 drive RAID 5s and 5 x 8 Drive RAID 6 volumes. The below diagram is an example of what an environment in this configuration may have with regards to free space per LUN.

LUNsFreeSpace

So we can see the average free space per LUN is 20%, but it varies from one LUN having only 5% free space and another having 35%.

In this case, when creating a new VM, or adding or expanding VMDKs for existing VMs, we have a situation where we will need to be careful about where we place a new VMDK from a capacity perspective but keeping in mind performance as well.

Now not all VMs or VMDKs are the same size, so if a new VMDK needs to be 500GB even though the environment may have well in excess of 500GB available, the fact that the free space is fragmented across multiple LUNs means we cannot create the new VMDK without first migrating VMs across the LUNs.

Now Storage DRS can do a reasonable job of this, but that takes time and impacts performance (during the Storage vMotion) and depending on the size of the VMs in the environment may not always be able to solve the issue.

Best case scenario, in my experience is at least 10% of capacity is wasted simply because of the fact the drives are carved up into RAID groups and VMs don’t fit within the inflexible LUNs.

39.65TB – 10% (Wasted Capacity due to de-fragmented free space) = 35.68TB

Usable space so far from 100TB RAW is only 35.68TB or approx 1/3rd!

Other factors which reduce usable capacity?

8. LUN Provisioning Type

In many cases, especially when talking about high performance applications, storage vendors recommend using Thick Provisioned LUNs.

As a result limited or no overcommitment can be achieved which reduces the usable capacity due to the thick provisioning.

It’s anyone’s guess how much space is wasted as a result.

Summary:

From the 100TB RAW factoring in what I believe to be realistic configuration of RAID, the impact of free space requirements, thick provisioning and capacity fragmentation we end up with only 35.68TB usable capacity or approx 1/3rd of the RAW.

Now most vendors provide some form of data reduction such as compression/de-duplication, others recommend some thin provisioning and these may increase the effective capacity, but this example shows its not as simple as you think to size for SAN/NAS storage and the overhead of RAID is only one of the many factors which impact the effective usable capacity.

In Part 2, I will run through a similar example for Nutanix usable capacity.