Example Architectural Decision – vSphere 5.1 Single Sign On (SSO) deployment mode across Active/Active Datacenters

Problem Statement

What is the most suitable deployment mode for vCenter Single-Sign On (SSO) in an environment where there are two (2) physical datacenters running in an Active/Active configuration?

Requirements

1. The solution must be a fully supported configuration
2. Meet/Exceed RTO of 4 hours
3. Environment must support SRM failover between Datacenter A and Datacenter B where an entire datacenter is lost

Assumptions

1.Three (3) vCenter servers will be used, One (1) at Datacenter A and Two (2) at Datacenter B
2. Environment has Two (2) Production clusters (One per Datacenter), and One (1) vCloud Cluster at Datacenter B each with a dedicated vCenter
3. Stretched clusters are not used
4. All vSphere Infrastructure servers (including SSO) are protected by SRM and vSphere HA
5. Inter-site Metropolitan Area Network is high bandwidth (>10Gb) , low latency (<5ms) and highly available (99.999%)
6. The average number of authentications per second for each SSO instance is <30 (Configuration Maximum)

Constraints

1. The environment uses traditional agent based backup solution which may not meet RPO/RTO requirements

Motivation

1. Future proof the environment

Architectural Decision

1. Use “Multi-site” SSO deployment mode
2. Do not use SSO “High Availability” clusters
3. The Primary SSO server will be at Datacenter B
4. The remaining vCenter servers will be “Secondaries” and point to the Datacenter B Primary SSO instance
5. The each SSO instance will be on a dedicated Windows 2008 x64 R2 instance
6. Each SSO instance will use the bundled SQL database
7. (Optional) For greater availability , vCenter Heartbeat will be used to protect each SSO instance

Justification

1. The environment is being designed (where) possible to sustain a Metropolitan Area Network failure between the two (2) datacenters

2. If “High Availability” mode is used, at least one (1) vCenter would be accessing SSO across the MAN link which introduces an unnecessary dependency on the MAN links

3. “High Availability” currently requires manual intervention which can be complicated and problematic

4. “Basic” mode prevents the use of Linked Mode which will make management of the environment more difficult

5. Using Multisite mode allows faster access to authentication services as each SSO instance is configured with Active Directory servers located at the same datacenter.

6. Multisite mode is required for the use of Linked-Mode and Linked Mode will  make day to day management easier

7. If one instance SSO goes offline for any reason, this will not impact production virtual machines. It will simply prevent any authentication to the affected vCenter server.

8. Having the SSO Primary at Datacenter B ensures only traffic from one vCenter (Datacenter A vCenter) traverses the MAN link as the third vCenter (for vCloud Director) is at Datacenter B

9. In the event of Datacenter B having a full datacenter wide failure for any reason, the Primary SSO instance being offline will not impact the management of Datacenter A OR the ability for the environment being recovered by SRM.

10. During an SSO upgrade, multiple vCenter’s cannot co-exist and using a centralized (or shared) SSO instance would overly complicate the upgrade process and lead to extended impact to the vSphere environments.

Alternatives

1. Use “Basic” Mode, resulting in a standalone version of SSO for each vCenter server

2. Use “High Availability Cluster” (Shared the same SSO database and identity sources) with one SSO server per physical datacenter

3. Use “Multisite” deployment with “High Availability Clusters” per datacenter

4. Host SSO database on a SQL Server

5. Run SSO on the vCenter server with or without the SSO database locally

6. Run a single SSO instance shared by all three (3) vCenters and use vCenter Heartbeat running across the MAN to protect SSO

Implications

1. Without a “High Availability Cluster” or SSO being protected by vCenter Heartbeat at each datacenter, the SSO for each site is a Single point of failure where authentication to the affected vCenter will fail

2. In the event of one (1) SSO server failing at Datacenter A, the SSO role does not failover to Datacenter B, or vice versa. In this case, All authentication requests on the site where SSO has failed, will fail.

3. Requires the installable version of SSO, which is Windows Only. The use of the vCenter Server Appliance (VCSA) is not available.

4. Additional Windows 2008 licenses are required for the SSO servers

Related Articles

1. Disabling Single Sign On – Dont Do It! – LongWhiteClouds

2. vSphere 5.1 Single Sign On (SSO) Configuration – Architectural Decision flowchart

I would like to Thank Michael Webster VCDX#66 (@vcdxnz001) for his contribution to this example architectural decision.

CloudXClogo

 

 

Example Architectural Decision – BC/DR Solution for vCloud Director

Problem Statement

What is the most suitable BC/DR solution for a vCloud director environment?

Requirements

1. Ensure the vCloud solution can tolerate a site failure in an automated manner
2. Ensure the vCloud solution meets/exceeds the RTO of 4hrs
3. Comply with all requirements of the Business Continuity Plan (BCP)
4. Solution must be a supported vSphere / vCloud Configuration
5. Ensure all features / functionality of the vCloud solution are available following a DR event

Assumptions

1. Datacenters are in an Active/Active configuration
2. Stretched Layer 2 network across both datacenters
3. Storage based replication between sites
4. vSphere 5.0 Enterprise Plus or later
5. VMware Site Recovery Manager 5.0 or later
6, vCloud Director 1.5 or later
7. There is no requirement for workloads proposed to be hosted in vCloud to be at one datacenter or another

Constraints

1. The hardware for the solution has already been chosen and purchased. 6 x 4 Way, 32 core Hosts w/ 512GB RAM and 4 x 10GB
2. The storage solution is already in place and does not support a Metro Storage Cluster (vMSC) configuration

Motivation

1. Meet/Exceed availability requirements
2. Minimize complexity

Architectural Decision

Use the vCloud DR solution as described in the “vCloud Director Infrastructure Resiliency Case Study” (By Duncan Epping @duncanyb and Chris Colotti @Ccolotti )

In Summary, Host the vSphere/vCloud Management virtual machines on an SRM protected cluster.

Use a dedicated cluster for vCloud compute resources.

Configure the vSphere cluster which is dedicated to providing compute resources to the vCloud environment (Provider virtual data center – PvDC) to have four (4) compute nodes  located at Datacenter A for production use and two (2) compute nodes located at Datacenter B (in ”Maintenance mode”) dedicated to DR.

Storage will not be stretched across sites; LUNs will be presented locally from “Datacenter A” shared storage to the “Datacenter A” based hosts. The “Datacenter A” storage will be replicated synchronously to “Datacenter B” and presented from “Datacenter B” shared storage to the two (2) “Datacenter B” based hosts. (No stretched Storage between sites)

In the event of a failure, SRM will recover the vSphere/vCloud Management virtual machines bringing back online the Cloud, then a script as the last part of the SRM recovery plan, Mounts the replicated storage to the ESXi hosts in “Datacenter B” and takes the two (2) hosts at “Datacenter B” out of maintenance mode. HA will then detect the virtual machines and power on them on.

Justification

1. Stretched Clusters are more suited to Disaster Avoidance than Disaster Recovery
2. Avoids complex and manual  intervention in the case of a disaster in the case of a stretched cluster solution
3. A Stretched cluster provides minimal control in the event of a Disaster where as in this case, HA simply restarts VMs once the storage is presented (automatically) and the hosts are taken out of Maintenance mode (also automated)
4. Having  two (2) ESXi hosts for the vCloud resource cluster setup in “Datacenter B” in “Maintenance Mode” and the storage mirrored as discussed  allows the virtual workloads to be recovered in an automated fashion as part of the VMware Site Recovery Manager solution.
5. Removes the management overhead of managing a strecthed cluster using features such as DRS affinity rules to keep VMs on the hosts on the same site as the storage
6. vSphere 5.1 backed resource clusters support >8 host clusters for “Fast provisioning”
7. Remove the dependency on the Metropolitan Area Data and Storage networks during BAU and the potential impact of the latency between sites on production workloads
8. Eliminates the chance of a “Split Brain” or a “Datacenter Partition” scenario where VM/s can be running at both sites without connectivity to each other
9. There is no specific requirement for non-disruptive mobility between sites
10. Latency between sites cannot be guaranteed to be <10ms end to end

Alternatives

1. Stretched Cluster between “Datacenter A” and “Datacenter B”
2. Two independent vCloud deployments with no automated DR
3. Have more/less hosts at the DR site in the same configuration

Implications

1. Two (2) ESXi hosts in the vCloud Cluster located in “Datacenter B” will remain unused as “Hot Standby” unless there is a declared site failure at “Datacenter A”
2. Requires two (2) vCenter servers , one (1) per Datacenter
3. There will be no non-disruptive mobility between sites (ie: vMotion)
4. SRM protection groups / plans need to be created/managed Note: This will be done as part of the Production cluster
5. In the event of a DR event, only half the compute resources will be available compared to production.
6. Depending on the latency between sites, storage performance may be reduced by the synchronous replication as the write will not be acknowledged to the VM at “Datacenter A” until committed to the storage at “Datacenter B”

CloudXClogo

 

 

Example Architectural Decision – Guest OS Page File Storage in vSphere

Problem Statement

In a vSphere environment using deduplication and an array snapshot based backup solution, Guest OS page files are currently stored on the OS drive (VMDK) which reduces the effectiveness of deduplication as well as placing an overhead on the controllers having to scan data which cannot be deduplicated.

As the Guest OS Paging files are being included in the snapshot process (with the guest OS) this also demands additional capacity for both primary and secondary disk storage for disk to disk backups.

How can this overhead be minimized or eliminated?

Requirements

1. Make the most efficient use of the available storage capacity
2. Maintain consistent level of virtual machine / storage performance
3. Minimize the storage required for primary and secondary snapshot based backups
4. Maintain the array level snapshot based backup solution as it is required to meet RPO/RTOs
5. Maintain the use of deduplication and this has proven to decrease storage requirements and improve performance

Assumptions

1. vSphere 5.0 or later
2. VMFS 5 Datastores which are Thin Provisioned
3. Deduplication is in use for Volumes where Guest OS virtual disks are stored
4. VAAI is supported by the array and enabled across the vSphere environment
5. All datastores are presented to all hosts within the cluster
6. Snapshot based backup solution is being used
7. Virtual Machines are right sized
8. Disk to disk backup data is replicated offsite

Constraints

1. None

Motivation

1. Optimize the storage performance
2. Ensure Tier 1 storage is not wasted with transient files
3. Minimize storage required for snapshot based backups

Architectural Decision

Separate OS page files onto a dedicated VMDK, which will be located on a datastore (or datastore cluster) which is
1. Not Protected by the array level snapshot backup solution
2. Not running deduplication
3. Not running data compression

Justification

1. Allows page files to be stored on different underlying storage including (optionally) high capacity, lower cost, SATA disk
2. Relocating Guest OS page files to another datastore (or datastore cluster) not protected ny snapshots dramatically reduces the amount of Data being protected by the Snapshot based backup solution
3. Reduces the amount of data being replicated to secondary disk backup location/s thus minimizing the bandwidth requirements between datacenters
4. (Optionally) Ensures Tier 1 storage is only used for high performance guests
5. The result of the Virtual Machines being right sized the performance impact/frequency of paging should be minimal
6. Reduces the CPU cycles required for deduplication as data which cannot be deduplicated will not be scanned
7. Reduces the CPU cycles on the storage controllers by not attempting to compress page file data

Alternatives

1. Leave Page Files within the Virtual machines primary VMDK an accept the overhead on the backup solution
2. Turn of paging within the Guest OS (No Page File)

Implications

1. The additional steps of creating a dedicated VMDK for the VM and configuring the Guest OS to use the alternate location
2. Templates need to be updated to the above configuration
3. For environments using Site Recovery Manager,for protected virtual machines, some manual steps are required when setting up the virtual machines for the first time. This increases the work required during setup, however as this is a one time overhead, it is believed the benefit of reduced backup storage and replication traffic (for SRM) outweighs the one time overhead

vmware_logo_ads