VADP or Agent Based Backups

In light of ongoing bugs with VMware’s API for Data Protection (VADP), I figured it worth re-visiting the topic of VADP or Agent Based backups.

VADP gives backup products the ability to kick off snapshots and use Changed Block Tracking (CBT) to allow incremental style backups which improve the efficiency of backup solutions by reducing the impact (performance, think storage, network and compute overheads) and duration (backup window).

But the problem is, there has now been several instances of VADP bugs in recent years which has meant incremental backups have lacked integrity due to the changed blocks not being correctly reported.

Here is a list of some of the VADP related issues/bugs:

  1. Backups with Changed Block Tracking can return incorrect changed sectors in ESXi 6.0 (2136854)
  2. Backing up a virtual machine with Changed Block Tracking (CBT) enabled fails after upgrading to or installing VMware ESXi 6.0 (2114076)
  3. Changed Block Tracking (CBT) on virtual machines (1020128)
  4. Enabling or disabling Changed Block Tracking (CBT) on virtual machines(1031873)
  5. Changed Block Tracking is reset after a storage vMotion operation in vSphere 5.x (2048201)
  6. When Changed Block Tracking is enabled in VMware vSphere 5.x, vMotion migration fails with error: The source detected that the destination failed to resume (2086670)
  7. QueryChangedDiskAreas API returns incorrect sectors after extending virtual machine VMDK file with Changed Block Tracking (CBT) enabled (2090639)

From the above (albeit a limited list of VADP related issues) we can see that there are issues related to integrity of VADP CBT as well as operational considerations (limitations) when using CBT, such as not being able to Storage vMotion and having vMotion operations fail.

So while VADP in theory has its advantages, should it be used in production environments?

At this stage I am highlighting the risks associated with using VADP with customers and where required/possible mitigating the issue.

But what about good ol’ agent based backups?

Agent based backups have a bad rap in my opinion mainly because of 3-Tier solutions and the fact backup windows take a long time due to the contention in the storage network, controllers and back end disk.

Now people ask me all the time, how can we do backups on Nutanix? The answer is, you have numerous (very good) options without using VADP (or for non vSphere customers).

Using a product like Commvault, In-Guest Agent’s can be deployed and managed centrally, removing much of the administrative overhead (downside) of agent based backups.

Then by configuring incremental forever backups, Commvault manages the change block tracking (regardless of hypervisor) and can even do source side deduplication and compression before sending the delta’s over the network to the Commvault Media Agent (ie.: The backup server).

Now since all new write I/O is written to Nutanix SSD tier, it is very likely that all changes will still be in the SSD tier when a daily incremental backup is started meaning the delta’s will be quickly read and send over the network. Why is this solving the problems of 3-Tier i discussed earlier, well its thanks to data locality and the fact Nutanix XCP is a highly distributed platform.

Because each Nutanix node has a local storage controller with local SSD, AND critically, Data Locality writes new data to the node where the VM is running, most data (under normal situations) will be read locally (without traversing a NIC/HBA or the storage network). This means there is no impact on other nodes from the backup of VMs on each node.

Due to these factors, the only traffic traversing the IP network to the backup server (Commvault Media Agent in this example), are the delta changes in a compressed and deduplicated format.

So a Commvault Agent Based backup solution on Nutanix XCP, on any hypervisor, avoids the dependancy on hypervisor APIs (which have proven in several cases not to be reliable) and ensures backup windows and the impact of backup jobs is minimal due to intelligent incremental forever style backups running on an intelligent distributed storage fabric.

In-Guest agent based backups may just be making a comeback!

Note: In y experience, Agent based backups typically provide more granularity/flexibility compared to VADP backups, for specifics speak with your preferred backup vendor.

Oh BTW, did I mention Nutanix XCP supports Commvault Intellisnap for storage level snapshots on the Distributed Storage Fabric… again just another option for Nutanix customers wanting to avoid further pain with VADP.

Data Centre Migration Strategies – Part 1 – Overview

After a recent twitter discussion, I felt a Data Centre migration strategies would be a good blog series to help people understand what the options are, along with the Pros and Cons of each strategy.

This guide is not intended to be a step by step on how to set-up each of these solutions, but a guide to assist you making the best decision for your environment when considering a data centre migration.

So what’s are some of the options when migrating virtual machines from one data centre to another?

1. Lift and Shift

Summary: Shut-down your environment and Physically relocate all the required equipment to the new location.

2. VMware Site Recovery Manager (SRM)

Summary: Using SRM with either Storage Replication Adapters (SRAs) or vSphere Replication (VR) to perform both test and planned migration/s between the data centres.

3. vSphere Metro Storage Cluster (vMSC)

Summary: Using an existing vMSC or by setting up a new vMSC for the migration, vMotion virtual machines between the sites.

4. Stretched vSphere Cluster / Storage vMotion

Summary: Present your storage at one or both sites to ESXi hosts at one or both sites and use vMotion and Storage vMotion to move workloads between sites.

5. Backup & Restore

Summary: Take a full backup of your virtual machines, transport the backup data to a new data centre (physically or by data replication) and restore the backup onto the new environment.

6. Vendor Specific Solutions

Summary: There are countless vendor specific solutions which range from Storage layer, to Application layer and everything in between.

7. Data Replication and re-register VMs into vCenter (or ESXi) inventory

Summary: The poor man’s SRM solution. Setup data replication at the storage layer and manually or via scripts re-register VMs into the inventory of vCenter or ESXi for sites with no vCenter.

Each of the above topics will be discussed in detail over the coming weeks so stay tuned, and if you work for a vendor with a specific solution you would like featured please leave a comment and I will get back to you.

Example Architectural Decision – Storage DRS Configuration for VMFS Datastores in a vCloud Environment

Problem Statement

In a production , self service vCloud Director environment, What is the most suitable Storage DRS configuration to improve storage utilization , performance, as well as reduce administrative effort for BAU staff?

Requirements

1. Make the most efficient use of the available storage capacity
2. Maintain consistent level of storage performance
3. Reduce the risk and overhead of capacity management
4. Reduce the risk of a unintentional or otherwise DoS event caused by self service

Assumptions

1. vSphere 5.0 or later
2. VMFS 5 Datastores which are Thick Provisioned
3. Deduplication is not in use
4. VAAI is supported by the array and enabled across the vSphere environment
5. All datastores in each respective Datastore clusters reside on the same RAID type with similar spindle types and count
6. All datastores are presented to all hosts within the cluster
7. Array level snapshots are not in use
8. IBM SVC Storage is being used
9. vCloud Director 5.1 or later
10. Storage I/O Control is enabled at set to 30ms

Constraints

1. IBM SVC storage does not currently support VASA (VMware API for Storage Awareness)

Motivation

1. Ensure production storage performance is not negatively impacted
2. Minimize the vSphere administrators workload where possible

Architectural Decision

Set the DRS automation setting to “Fully Automated”

  • Set “Utilized Space” threshold to 80%
  • Set “I/O latency” to 15ms
  • I/O Metric Inclusion – Enabled

Advanced Options

  • No recommendations until utilization difference between source and destination is: 10%
  • Evaluate I/O load every 8 Hours
  • I/O Imbalance threshold  4

Justification

1. Setting Storage DRS to “Fully Automated”  ensures that the administrator does not need to be concerned with initial placement of virtual machines as this will be dynamically and intelligently determined and executed

2. “XCOPY” is fully supported for Block based storage, as such, any Storage vMotion activity is offloaded to the array therefore removing the I/O overhead on the compute and storage fabric.

3. Where a significant I/O imbalance is detected by SDRS, the vSphere administrator is not required to take any action, Storage DRS can identify and remediate issues which fall outside parameters (which are determined by the VMware Architect) automatically. This improves the efficiency of the environment, and reduces the involvement of BAU.

4. SDRS provides valuable “initial placement” for new virtual machines which will help avoid a situation where datastores are unevenly balanced from a capacity perspective in the first place, therefore reducing the chance of virtual machines requiring migration.

5. Setting the “No recommendations until utilization difference between source and destination is” to 20% ensures that SDRS does not move virtual machines around where significant benefit is not realized  This prevents unnecessary Storage vMotion activity on the disk system, although this is offloaded from the host to the array, the I/O still may impact production performance for workloads on the same disk system.

6. Setting the “I/O Imbalance threshold (5 Aggressive / Conservative 1 ) to “4” (2nd most aggressive)  ensures that I/O imbalance should be addressed before significant imbalance is experienced by the end users. This level of “ aggressiveness” is acceptable as the Storage vMotion can be offloaded (via VAAI “XCOPY” primitive  and has almost zero impact on the host.  Setting this to “5” may result in minor I/O imbalances being corrected, at the cost of a Storage vMotion and as a result the impact of the more frequent Storage vMotion activity may negate the benefit of the I/O balancing.

7. Storage DRS will address I/O imbalance across the datastore cluster if the latency meets or exceeds the set value of 15ms (the default) and in the event of latency increasing during peak times to >=30ms , Storage I/O Control will ensure fair acess to the storage.

Alternatives

1. Use “No Automation (Manual Mode)”
2. Not use Storage DRS

Implications

1. When selecting datastores for the datastore cluster, having VASA enabled allows the “System Capability” column to be populated in the “New Datastore Cluster” wizard to ensure suitable datastores of similar performance, RAID type and features are grouped together. VASA is currently NOT supported by SVC, as such the datastore naming convension needs to accurately reflect the capabilities of the LUN/s to ensure suitable datastores are grouped together.

vmware_logo_ads