The truth about Storage Data efficiency ratios.

We’ve all heard the marketing claims from some storage vendors about how efficient their storage products are. Data efficiency ratios of 40:1 , 60:1 even 100:1 continue to be thrown around as if they are amazing, somehow unique or achieved as a result of proprietary hardware.

Let’s talk about how vendors may try to justify these crazy ratios:

For many years, Storage vendors have been able to take space efficient copies of LUNs, Datastores, Virtual Machines etc which rely on snapshots or metadata. These are not full copies and reporting this as data efficiency is quite mis-leading in my opinion as this is and has been for many years Table stakes.

Be wary of vendors encouraging (or requiring) you configure more frequent “backups” (which are after all just Snapshots or metadata copies) to achieve the advertised data efficiencies.

  • Reporting VAAI/VCAI clones as full copies

If I have a VMware Horizon View environment, It makes sense to use VAAI/VCAI space efficient clones as they provide numerous benefits including faster provisioning, recompose and use less space which leads to them being served from cache (making performance better).

So if I have an environment with just 100 desktops deployed via VCAI, You have a 100:1 data reduction ratio, 1000 desktops and you have 1000:1. But this is again Table stakes… well sort of because some vendors don’t support VAAI/VCAI and others only have partial support as I discuss in Not all VAAI-NAS storage solutions are created equal.

Funnily enough, one vendor even offloads what VAAI/VCAI can do (with almost no overhead I might add) to proprietary hardware. Either way, while VAAI/VCAI clones are fantastic and can add lots of value, claiming high data efficiency ratios as a result is again mis-leading especially if done so in the context of being a unique capability.

  • Compression of Highly compressible data

Some data, such as Logs or text files are highly compressible, so ratios of >10:1 for this type of data are not uncommon or unrealistic. However consider than if logs only use a few GB of storage, then 10:1 isn’t really saving you that much space (or money).

For example a 100:1 data reduction ratio of 100MB of logs is only saving you ~10GB which is good, but not exactly something to make a purchasing decision on.

Also compression of databases which lots of white space also compress very well, so the larger the Initial size of the DB, the more it will compress.

The compression technology used by storage vendors is not vastly different, which means for the same data, they will all achieve a similar reduction ratio. As much as I’d love to tell you Nutanix has much better ratios than Vendors X,Y and Z, its just not true, so I’m not going to lie to you and say otherwise.

  • Deduplication of Data which is deliberately duplicated

An example of this would be MS Exchange Database Availability Groups (DAGs). Exchange creates multiple copies of data across multiple physical or virtual servers to provide application and storage level availability.

Deduplication of this is not difficult, and can be achieved (if indeed you want to dedupe it) by any number of vendors.

In a distributed environment such as HCI, you wouldn’t want to deduplicate this data as it would force VMs across the cluster to remotely access more data over the network which is not what HCI is all about.

In a centralised SAN/NAS solution, deduplication makes more sense than for HCI, but still, when an application is creating the duplicate data deliberately, it may be a good idea to exclude it from being deduplicated.

As with compression, for the same data, most vendors will achieve a similar ratio so again this is table stakes no matter how each vendor tries to differentiate. Some vendors dedupe at more granular levels than others, but this provides diminishing returns and increased overheads, so more granular isn’t always going to deliver a better business outcome.

  • Claiming Thin Provisioning as data efficiency

If you have a Thin Provisioned 1TB virtual disk and you only write 50GB to the disk, you would have a data efficiency ratio of 20:1. So the larger you create your virtual disk and the less data you write to it, the better the ratio will be. Pretty silly in my opinion as Thin Provisioning is nothing new and this is just another deceptive way to artificially improve data efficiency ratios.

  • Claiming removal of zeros as data reduction

For example, if you create an Eager Zero Thick VMDK, then use only a fraction, as with the Thin Provisioning example (above), removal of zeros will obviously give a really high data reduction ratio.

However Intelegent storage doesn’t need Eager Zero Thick (EZT) VMDKs to give optimal performance nor will they write zeros to begin with. Intelligent storage will simply store metadata instead of a ton of worthless zeros. So a data reduction ratio from a more intelligent storage solution would be much lower than a vendor who has less intelligence and has to remove zeros. This is yet another reason why data efficiency (marketing) numbers have minimal value.

Two of the limited use cases for EZT VMDKs is Fault Tolerance (who uses that anyway) and Oracle RAC, so removal of zeros with intelligent storage is essentially moot.

Summary:

Data reduction technologies have value, but they have been around for a number of years so if you compare two modern storage products, you are unlikely to see any significant difference between vendor A and B (or C,D,E,F and G).

The major advantage of data reduction is apparent when comparing new products with 5+ year old technology. If you are in this situation where you have very old tech, most newer products will give you a vast improvement, it’s not unique to just one vendor.

At the end of the day, there are numerous factors which influence what data efficiency ratio can be achieved by a storage product. When comparing between vendors, if done in a fair manner, the differences are unlikely to be significant enough to sway a purchasing decision as most modern storage platforms have more than adequate data reduction capabilities.

Beware: Dishonest and mis-leading marketing about data reduction is common so don’t get caught up in a long winded conversations about data efficiency or be tricked into thinking one vendor is amazing and unique in this area, it just isn’t the case.

Data reduction is table stakes and really shouldn’t be the focus of a storage or HCI purchasing decision.

My recommendation is focus on areas which deliver operational simplicity, removes complexity/dependancies within the datacenter and achieve real business outcomes.

Related Posts:

1. Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

2. Sizing infrastructure based on vendor Data Reduction assumptions – Part 2

3.Deduplication ratios – What should be included in the reported ratio?

Reminder: Copies of data on the same Primary Storage is not a backup solution.

I find it difficult to understand how any Account Manager, Sales Engineer or Consultant can go to a customer, who is at least in part trusting their statements & opinions when considering new product/s and make claims that a product is performing a “backup” function when the data remains on the same primary storage system (failure domain).

Most vendors have metadata or snapshot based options which allow space efficient recovery points to be maintained on primary storage for fast recovery and any vendor worth talking to will also tell you that until a FULL COPY of the data is maintained off the primary storage, it is NOT a backup.

Some vendors will play games and try and differentiate and say they don’t use snapshots and they are somehow amazing and unique. In reality, they can say whatever they like, but if the end result is the data is only maintained on primary storage, then its not a backup and you should not treat it like one.

In the old days, it was fairly common to have Primary data on one set of LUNs/RAID packs and for customers to keep full copies of data on different LUNs and underlying RAID packs before offloading to tape.

While the copy of data remained on primary storage, it at least meant that in the event the RAID pack/s hosting the primary data failed (e.g.: Double disk failure in a RAID 5) then data could be recovered and if not, then the customer could restore form tape.

As storage became more intelligent, keeping the full copy became less popular in favour of snapshot or metadata based copies. This makes a lot of sense as it reduced the overheads significantly while achieving a business outcome which allows for fast recovery in the event the Primary Storage is not impacted.

However, the requirement for data to be kept off the primary storage remains, as no matter what vendor you choose, its possible to have a catastrophic failure which means the snapshot/metadata copies on primary storage may not be available.

Also promoting that snapshots (or any form of metadata copies pointing to the same underlying blocks) are this amazing new data reduction technology which achieves 60:1 or 100:1 data reduction is misleading at best in my opinion.

So let’s cover off a few things:

Question 1: Are snapshots or metadata copies of data stored on primary storage a backup?

Answer: No

A snapshot or metadata based copies simply makes some data at various levels such as a vDisk, Virtual Machine , LUN , Container etc read only and new writes (commonly referred to as delta changes) are written elsewhere.

The data still resides on the same storage, meaning if data loss occurs (say multiple drive failures or storage system software issue) its possible if not probable that the data being referenced by the snapshot/metadata and delta changes will all be lost (or at least unavailable) in some failure scenarios depending on the vendor.

So having snapshot or metadata based copies on primary storage as a backup without at least one full copy in a seperate failure domain is simply asking for trouble.

Snapshots/metadata copies are only the first step in a backup solution which must ensure data is stored in at least two locations (different failure domains) so that data can be recovered in the event the primary storage is lost/unavailable for any reason.

Question 2: Are snapshots data reduction?

Answer: No

Snapshots and metadata copies don’t reduce data, they simply avoid creating and requiring the storage to store more data than is necessary to keep the point in time (or Recovery Point) copies (not backups) of data.

This is Data avoidance, not data reduction which cover this topic in more depth in a previous post: Deduplication ratios – What should be included in the reported ratio?

Now don’t get me wrong, Data avoidance (e.g.: Snapshots, Intelligent Cloning etc) has real value and its something I would recommend customers leverage wherever possible as it generally reduces the overheads on infrastructure significantly which can help achieve business outcomes like more frequent RPOs or faster deployment/maintenance times for VDI.

However making a claim that a customer has 60:1 or 100:1 data efficiency because they are taking frequent snapshots/metadata copies (which in many cases are unnecessary to meet business objectives) in my opinion is misleading customers and worse still, claiming its unique (as in other vendors cant achieve the same business outcome) is just a flat out lie.

Now I work for Nutanix, so let’s use another Vendor as an example, and one which I have lots of experience with from my years at IBM. Take Netapp (a.k.a IBM N-Series), for many years they have supported taking snapshots which are application consistent (via SnapManager) and keeping them on Primary storage. They as with many other vendors (new and legacy) do it in a way which avoids storing multiple copies of data and they redirect on write all delta changes which can be snapped at the next scheduled interval.

This results in the ability to keep lots of point in time copies without storing data multiple times. You could argue this is a ratio of “Insert crazy number here” :1 but the reality is, if the storage you have wasn’t storing 1:1 copies previously (which only a select few legacy products still do), a new solution doing similar isn’t a big step forward even if it could be argued it’s a bit more efficient.

Netapp allows these snapshots on primary storage to then be replicated to secondary storage (SnapVault) which is a different failure domain, with dedicated controller/s and disks. This allows for recovery of all data in the event the primary storage fails or is unavailable. Netapp also allow offload of snapshots to tape.

Many other vendors have similar functionality (and have for a long time) include but are not limited too: Pure Storage, Nutanix, EMC , Dell , IBM, the list goes on.

This functionality is table stakes… Not something unique to any one vendor or something that requires proprietary hardware to achieve.

Any vendor listed above (and others) can achieve the similar levels of data efficiency (if you want to use that term) if they all perform snapshots or metadata based copies at the same frequency. Each vendors implementations vary and each have pros and cons, but from a business outcome perspective (which is the ONLY thing that matters), its table stakes.

Question 3: What are Snapshots/Metadata copies on Primary storage good for?

Answer: They are good for creating recovery points to help achieve Recovery Point Objectives (RPOs) when combined with replication to secondary storage and/or tape/cloud to cater for site loss scenarios. Keeping snapshots on primary storage helps speed up recovery in the event you need to role back to a previous point in time assuming you have not had a storage failure. e.g.: Recovering a file or DB which was accidentally deleted or was corrupted for whatever reason.

So there is value in snapshots/metadata copies on primary storage, but it should not be considered a backup until it is replicated to another location, ideally offsite in a difference failure domain.

Summary:

Snapshots/Metadata based copies (on primary storage) are just the first step of many in an overall backup strategy. If the data is not replicated to another failure domain, it should not be called or considered a backup.

Marketing Claims of 60:1 or 100:1 data efficiency may sound good, but these sorts of numbers have been and can be achieved by many vendors for a long time. Be very careful when considering new infrastructure not to be mislead by these sorts of marketing claims.

Most vendors don’t market numbers like 60:1 or 100:1 because they understand its table-stakes and misleading for customers, and kudos to those vendors!

Snapshots/Metadata copies regardless of data efficiency ratio are USELESS in the event of a primary storage failure unless a full copy of the data is stored off the primary storage and depending on the business requirements, stored offsite.

I encourage the everyone, especially the industry analysts to help clarify this situation for customers as there is A LOT of mis-information being spread currently which puts customers at risk in the event of primary storage failures.

VADP or Agent Based Backups

In light of ongoing bugs with VMware’s API for Data Protection (VADP), I figured it worth re-visiting the topic of VADP or Agent Based backups.

VADP gives backup products the ability to kick off snapshots and use Changed Block Tracking (CBT) to allow incremental style backups which improve the efficiency of backup solutions by reducing the impact (performance, think storage, network and compute overheads) and duration (backup window).

But the problem is, there has now been several instances of VADP bugs in recent years which has meant incremental backups have lacked integrity due to the changed blocks not being correctly reported.

Here is a list of some of the VADP related issues/bugs:

  1. Backups with Changed Block Tracking can return incorrect changed sectors in ESXi 6.0 (2136854)
  2. Backing up a virtual machine with Changed Block Tracking (CBT) enabled fails after upgrading to or installing VMware ESXi 6.0 (2114076)
  3. Changed Block Tracking (CBT) on virtual machines (1020128)
  4. Enabling or disabling Changed Block Tracking (CBT) on virtual machines(1031873)
  5. Changed Block Tracking is reset after a storage vMotion operation in vSphere 5.x (2048201)
  6. When Changed Block Tracking is enabled in VMware vSphere 5.x, vMotion migration fails with error: The source detected that the destination failed to resume (2086670)
  7. QueryChangedDiskAreas API returns incorrect sectors after extending virtual machine VMDK file with Changed Block Tracking (CBT) enabled (2090639)

From the above (albeit a limited list of VADP related issues) we can see that there are issues related to integrity of VADP CBT as well as operational considerations (limitations) when using CBT, such as not being able to Storage vMotion and having vMotion operations fail.

So while VADP in theory has its advantages, should it be used in production environments?

At this stage I am highlighting the risks associated with using VADP with customers and where required/possible mitigating the issue.

But what about good ol’ agent based backups?

Agent based backups have a bad rap in my opinion mainly because of 3-Tier solutions and the fact backup windows take a long time due to the contention in the storage network, controllers and back end disk.

Now people ask me all the time, how can we do backups on Nutanix? The answer is, you have numerous (very good) options without using VADP (or for non vSphere customers).

Using a product like Commvault, In-Guest Agent’s can be deployed and managed centrally, removing much of the administrative overhead (downside) of agent based backups.

Then by configuring incremental forever backups, Commvault manages the change block tracking (regardless of hypervisor) and can even do source side deduplication and compression before sending the delta’s over the network to the Commvault Media Agent (ie.: The backup server).

Now since all new write I/O is written to Nutanix SSD tier, it is very likely that all changes will still be in the SSD tier when a daily incremental backup is started meaning the delta’s will be quickly read and send over the network. Why is this solving the problems of 3-Tier i discussed earlier, well its thanks to data locality and the fact Nutanix XCP is a highly distributed platform.

Because each Nutanix node has a local storage controller with local SSD, AND critically, Data Locality writes new data to the node where the VM is running, most data (under normal situations) will be read locally (without traversing a NIC/HBA or the storage network). This means there is no impact on other nodes from the backup of VMs on each node.

Due to these factors, the only traffic traversing the IP network to the backup server (Commvault Media Agent in this example), are the delta changes in a compressed and deduplicated format.

So a Commvault Agent Based backup solution on Nutanix XCP, on any hypervisor, avoids the dependancy on hypervisor APIs (which have proven in several cases not to be reliable) and ensures backup windows and the impact of backup jobs is minimal due to intelligent incremental forever style backups running on an intelligent distributed storage fabric.

In-Guest agent based backups may just be making a comeback!

Note: In y experience, Agent based backups typically provide more granularity/flexibility compared to VADP backups, for specifics speak with your preferred backup vendor.

Oh BTW, did I mention Nutanix XCP supports Commvault Intellisnap for storage level snapshots on the Distributed Storage Fabric… again just another option for Nutanix customers wanting to avoid further pain with VADP.