Data Centre Migration Strategies – Part 2 – Lift and Shift

Continuing on from Data Centre Migration Strategies Part 1 – Overview, Part 2 focuses on the “Lift and Shift” method.

I’m sure your reading this and already thinking, “this is the least interesting migration strategy, tell me about vMSC and SRM!” and well, your right, BUT it is important to understand the pros and cons so if you are ever in a situation where you have to use this method (I have on numerous occasions) that the migration is successful.

So what are the pros and cons of this method.

Pros

1. No need to purchase equipment for the new data centre
2. The environment should perform as it did at the original data centre following relocation
3.The approach is simple from a technical perspective ie: No new products are required
4. Low direct cost (Note: Point 8 in Cons)
5. Achieves a Recovery Point Objective (RPO) of zero (0).

Cons

1. The entire environment needs to be fully shut-down
2. The outage for the environment starts from when the servers are shut-down, until completion of operational verification testing at the new datacenter. Note: This may take several days depending on the size of the environment.
3. This method is high risk as the ability to fail back to the original datacenter requires all equipment be physically relocated back. This means the Recovery Time Objective (RTO) cannot be low.
4. The Lift and shift method cannot be tested until at least a significant amount of equipment has been physical relocated
5. In the event of an issue during operational verification at the new data centre, a decision needs to be made to proceed and troubleshoot the issues, OR at what point to fail back.
6. Depending on your environment, a vendor (eg: Storage) may need to revalidate your environment
7. Your migration (and schedule) are heavily dependant on the logistical side of the relocation which may have many factors (eg: Traffic / Weather) which are outside your control which may lead to delays or failed migration.
8. Potentially high indirect cost eg: Downtime, Loss of Business , productivity etc

When to use this method?

1. When purchasing equipment for the new data centre is not possible
2. When extended outages to the environment are acceptable
3. When you have no other options

Recommendations when using “Lift and Shift”

1. Ensure you have accurate wiring and rack diagrams of your datacenter
2. Be prepared with your vendor support contact details on hand as it is common following relocation of equipment to have hardware failures
3. Ensure you have an accurate Operational Verification document which tests every part of your environment from Layer 1 (Physical) all the way to Layer 7 (Application)
4. Label EVERYTHING as you disconnect it at the original datacenter
5. Prior to starting your data centre  migration, discuss and agree on a timeline for the migration and at what point and under what situation do you initiate a fail back.
6. Migrate the minimum amount of physical equipment that is required to get your environment back on-line and do your Operational Verification, then on successful completion of your Operational Verification migrate the remaining equipment. This allows for faster fail-back in the event Operational Verification fails.

In Part 3, we discuss Data centre migrations using VMware Site Recovery Manager. (Coming soon)

 

Data Centre Migration Strategies – Part 1 – Overview

After a recent twitter discussion, I felt a Data Centre migration strategies would be a good blog series to help people understand what the options are, along with the Pros and Cons of each strategy.

This guide is not intended to be a step by step on how to set-up each of these solutions, but a guide to assist you making the best decision for your environment when considering a data centre migration.

So what’s are some of the options when migrating virtual machines from one data centre to another?

1. Lift and Shift

Summary: Shut-down your environment and Physically relocate all the required equipment to the new location.

2. VMware Site Recovery Manager (SRM)

Summary: Using SRM with either Storage Replication Adapters (SRAs) or vSphere Replication (VR) to perform both test and planned migration/s between the data centres.

3. vSphere Metro Storage Cluster (vMSC)

Summary: Using an existing vMSC or by setting up a new vMSC for the migration, vMotion virtual machines between the sites.

4. Stretched vSphere Cluster / Storage vMotion

Summary: Present your storage at one or both sites to ESXi hosts at one or both sites and use vMotion and Storage vMotion to move workloads between sites.

5. Backup & Restore

Summary: Take a full backup of your virtual machines, transport the backup data to a new data centre (physically or by data replication) and restore the backup onto the new environment.

6. Vendor Specific Solutions

Summary: There are countless vendor specific solutions which range from Storage layer, to Application layer and everything in between.

7. Data Replication and re-register VMs into vCenter (or ESXi) inventory

Summary: The poor man’s SRM solution. Setup data replication at the storage layer and manually or via scripts re-register VMs into the inventory of vCenter or ESXi for sites with no vCenter.

Each of the above topics will be discussed in detail over the coming weeks so stay tuned, and if you work for a vendor with a specific solution you would like featured please leave a comment and I will get back to you.

Example Architectural Decision – BC/DR Solution for vCloud Director

Problem Statement

What is the most suitable BC/DR solution for a vCloud director environment?

Requirements

1. Ensure the vCloud solution can tolerate a site failure in an automated manner
2. Ensure the vCloud solution meets/exceeds the RTO of 4hrs
3. Comply with all requirements of the Business Continuity Plan (BCP)
4. Solution must be a supported vSphere / vCloud Configuration
5. Ensure all features / functionality of the vCloud solution are available following a DR event

Assumptions

1. Datacenters are in an Active/Active configuration
2. Stretched Layer 2 network across both datacenters
3. Storage based replication between sites
4. vSphere 5.0 Enterprise Plus or later
5. VMware Site Recovery Manager 5.0 or later
6, vCloud Director 1.5 or later
7. There is no requirement for workloads proposed to be hosted in vCloud to be at one datacenter or another

Constraints

1. The hardware for the solution has already been chosen and purchased. 6 x 4 Way, 32 core Hosts w/ 512GB RAM and 4 x 10GB
2. The storage solution is already in place and does not support a Metro Storage Cluster (vMSC) configuration

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity

Architectural Decision

Use the vCloud DR solution as described in the “vCloud Director Infrastructure Resiliency Case Study” (By Duncan Epping @duncanyb and Chris Colotti @Ccolotti )

In Summary, Host the vSphere/vCloud Management virtual machines on an SRM protected cluster.

Use a dedicated cluster for vCloud compute resources.

Configure the vSphere cluster which is dedicated to providing compute resources to the vCloud environment (Provider virtual data center – PvDC) to have four (4) compute nodes  located at Datacenter A for production use and two (2) compute nodes located at Datacenter B (in ”Maintenance mode”) dedicated to DR.

Storage will not be stretched across sites; LUNs will be presented locally from “Datacenter A” shared storage to the “Datacenter A” based hosts. The “Datacenter A” storage will be replicated synchronously to “Datacenter B” and presented from “Datacenter B” shared storage to the two (2) “Datacenter B” based hosts. (No stretched Storage between sites)

In the event of a failure, SRM will recover the vSphere/vCloud Management virtual machines bringing back online the Cloud, then a script as the last part of the SRM recovery plan, Mounts the replicated storage to the ESXi hosts in “Datacenter B” and takes the two (2) hosts at “Datacenter B” out of maintenance mode. HA will then detect the virtual machines and power on them on.

Justification

1. Stretched Clusters are more suited to Disaster Avoidance than Disaster Recovery
2. Avoids complex and manual  intervention in the case of a disaster in the case of a stretched cluster solution
3. A Stretched cluster provides minimal control in the event of a Disaster where as in this case, HA simply restarts VMs once the storage is presented (automatically) and the hosts are taken out of Maintenance mode (also automated)
4. Having  two (2) ESXi hosts for the vCloud resource cluster setup in “Datacenter B” in “Maintenance Mode” and the storage mirrored as discussed  allows the virtual workloads to be recovered in an automated fashion as part of the VMware Site Recovery Manager solution.
5. Removes the management overhead of managing a strecthed cluster using features such as DRS affinity rules to keep VMs on the hosts on the same site as the storage
6. vSphere 5.1 backed resource clusters support >8 host clusters for “Fast provisioning”
7. Remove the dependency on the Metropolitan Area Data and Storage networks during BAU and the potential impact of the latency between sites on production workloads
8. Eliminates the chance of a “Split Brain” or a “Datacenter Partition” scenario where VM/s can be running at both sites without connectivity to each other
9. There is no specific requirement for non-disruptive mobility between sites
10. Latency between sites cannot be guaranteed to be <10ms end to end

Alternatives

1. Stretched Cluster between “Datacenter A” and “Datacenter B”
2. Two independent vCloud deployments with no automated DR
3. Have more/less hosts at the DR site in the same configuration

Implications

1. Two (2) ESXi hosts in the vCloud Cluster located in “Datacenter B” will remain unused as “Hot Standby” unless there is a declared site failure at “Datacenter A”
2. Requires two (2) vCenter servers , one (1) per Datacenter
3. There will be no non-disruptive mobility between sites (ie: vMotion)
4. SRM protection groups / plans need to be created/managed Note: This will be done as part of the Production cluster
5. In the event of a DR event, only half the compute resources will be available compared to production.
6. Depending on the latency between sites, storage performance may be reduced by the synchronous replication as the write will not be acknowledged to the VM at “Datacenter A” until committed to the storage at “Datacenter B”

CloudXClogo