Competition Example Architectural Decision Entry 3 – Scalable network architecture for VXLAN

Name: Prasenjit Sarkar
Title: Senior Member of Technical Staff
Company: VMware
Twitter: @stretchcloud
Profile: VCAP-DCD4/5,VCAP-DCA4/5,VCAP-CIA,vExpert 2012/2013

Problem Statement

You are moving towards scalable network architecture for your large scale Virtualized Datacenter and want to configure VXLAN in your environment. You want to make sure that Teaming Policy for VXLAN transport is configured optimally for better performance and reduce operational complexity around it.


1. vSphere 5.1 or greater
2. vCloud Networking & Security 5.1 or greater
3. Core & Edge Network topology is in place


1. Should have switches that support Static Etherchannel or LACP (Dynamic Etherchannel)
2. Have to use only IP Hash Load balancing method if using vSphere 5.1
3. Cannot use Beacon Probing as Failure Detection mechanism


1. Optimize performance for VXLAN

2. Reduce complexity where possible

3. Choosing best teaming policy for VXLAN Traffic for future scalability

Architectural Decision

LACP – Passive Mode will be chosen as the teaming policy for the VXLAN Transport.

At least two or more physical links will be aggregated using LACP in the upstream Edge switches.

Two Edge switches will be connected to each other.

ESXi host will be cross connected to these two Physical upstream switches for forming a LACP group.

LACP will be configured in Passive mode in Edge switches so that the participating ports responds to the LACP packets that it receives but does not initiate LACP negotiation.


1. Use LACP – Active Mode and make sure you are using IP Hash algorithm for the load balancing in your vDS if using vSphere 5.1.

2. Use LACP – Active Mode and use any of the 22 available load balancing algorithm in your vDS if using vSphere 5.5.

3. Use LACP – Active Mode and use Cisco Nexus 1000v virtual switch and use any of the 19 available load balancing algorithm.

4. Use Static Etherchannel and make sure you are using IP Hash *Only* algorithm in your vDS.

5. If using Failover then have at least one 10G NIC to handle the VXLAN traffic.


1. Fail Over teaming policy for VXLAN vmkernel NIC uses only one uplink for all VXLAN traffic. Although redundancy is available via the standby link, all available bandwidth is not used.
2. Static Etherchannel requires IP Hash Load Balancing be configured on the switching infrastructure, which uses a hashing algorithm based on source and destination IP address to determine which host uplink egress traffic should be routed through.

3. Static Etherchannel and IP Hash Load Balancing is technically very complex to implement and has a number of prerequisites and limitations, such as, you can’t use beacon probing, you can’t configure standby or unused link etc.

4. Static Etherchannel does not do pre check both the terminating ends before forming the Channel Group. So, if there are issues within two ends then traffic will never pass and vSphere will not see any acknowledgement back in it’s Distributed Switches

5. Active LACP mode places a port into an active negotiating state, in which the port initiates negotiations with other ports by sending LACP packets. If using vSphere prior to 5.5 where only IP Hash algorithm is supported then LACP will not pass any traffic if vSphere uses any other algorithm other than IP Hash (such as Virtual Port ID)

6. The operational complexity is reduced

7. If using vSphere 5.5 then can use 22 different algorithm for load balancing and also Beacon Probing can be used for Failure Detection.


1. Initial setup has a small amount of additional complexity however this is a one time task (Set & Forget)

2. Only IP Hash algorithm is supported if using vSphere 5.1

3. Only one LAG can be supported for the entire vSphere Distributed Switches if using vSphere 5.1

4. IP Hash calculation if not done manually by taking VM’s vNIC and Physical NIC then there is no guarantee that it will balance the traffic across physical links

Back to Competition Main Page or Competition Submissions

Competition Example Architectural Decision Entry 2 – Use of RDMs in Standard IaaS Clusters

Name: Chris Jones
Title: Virtualization Architect
Twitter: @cpjones44
Profile: VCP5 / VCAP5-DCD

Problem Statement

VMs require more than 1.9TB in a single disk. The existing virtual environment has LUNs provisioned that are 2TB in size. As these VMs have virtual data disks (VMDKs) that are > 1.9TB in size, alarms are being triggered by the infrastructure monitoring solution and raising Incident tickets to the Virtual Infrastructure support queue.


1. Data within the OSI must reside within the VM and not on some kind of IP based store (like a NAS share).

2. vSphere datastores are presented through FC and not IP based stores (ie. NFS).

3. vSphere Hypervisor is ESXi 4.1.

4. There is no requirement for the VMs to be performing SAN specific functionality or running SCSI target-based software.


1. The implemented monitoring solution cannot be customised with triggers and monitoring policies for individual objects within the environment (ie. having one monitoring policy per individual or sub-group of datastores).

2. Maximum vSphere datastore size in version 4.1 is 2TB minus 512 bytes.

3. Unable to upgrade beyond ESXi 4.1 Update 3.


1. Reduce the number of incident tickets being raised, thus improving SLA posture.

2. Reduce the requirement to span single Windows logical volumes across multiple VMDKs.

Architectural Decision

Turn the disk into an RDM (Virtual Compatibility Mode) to remove the level of monitoring from the vSphere layer.


1. Create smaller VMDKs (ie. 1-1.5TB disks) and create a RAID0 volume within the guest OS.

2. Change the level of alerting so that tickets are not raised for alerts that trigger beyond 90%.

3. Turn the disk into an RDM to remove the level of monitoring from the vSphere layer.

4. Thin Provision the virtual disks

5. Store the data within the guest on some kind of IP based storage (NAS/iSCSI target).


1. Option 5 goes against the assumption that data must be local to the VM, so was ruled out.

2. Whilst thin provisioning (Option 4) is an attractive solution, this option is ruled out based on a wider infrastructure decision to thick provision all disks in the environment to reduce the risk to datastores filling up and critical business VMs stopping.

3. Option 1 via smaller VMDKs spread across multiple vSphere datastores will result in these alerts disappearing, however it will create issues when trying to execute a DR recovery for either the individual disks (Active/Passive) or the whole VM (Active/Cold). All that’s needed is for one VMDK not to be replicated and the whole Windows volume will be corrupted, or for the VMDKs to be mounted in the wrong order. Multiple VMDKs to one Windows volume also complicates the recovery of snapshot array-based backups (eg. via SMVI or NetBackup).

4. Option 2 goes against the constraint of the infrastructure monitoring solution not being able to creating individual alerting policies for either a single or sub-group of datastores in the inventory. Should individualised policies be created, we would need to ensure that the affected VMDKs that consume 90-95% of a datastore remain on that datastore as moving from one to another (ie. from Tier 2 to Tier 1) will require a change to the monitoring that has been configured. At this stage, the monitoring solution has no way to track these customised policies, which is most of the reason why global environment wide policies exist.

5. Option 3 and the use of RDMs in Virtual Compatibility Mode will allow the VM to benefit from the features of VMFS, such as advanced file locking for data protection and vSphere snapshotting. The use of RDMs will also allow for VMs to be managed by DRS (ie. can be vMotion’ed) and protected by vSphere HA.


  1. The RDM mapping will need to be recorded clearly to avoid the lengthy process of discovering from scratch what physical LUN is presented to the virtual machine.

An example of how to map these will be to:

A)    Record the name of the VM that has the RDM.

B)    Record the NAA number of the physical LUN(s) that are presented to the VM.

C)    Record the virtual device node on the virtual disk controller as to where the RDM is mounted.

D)    Record the Windows drive letter that this RDM is mounted to.

2. Additional paths will be consumed, reducing the total number of vSphere datastores that can be presented to the cluster.

Back to Competition Main Page or Competition Submissions

Example Architectural Decision Competition – Submissions

All suitable Example architectural decision submissions will be posted here, please vote for your favourite decision by leaving a comment on this page with the example decision number.


1. TSM backup configuration for PureFlex environment?

2. Use of RDMs in Standard IaaS Clusters

3. Scalable network architecture for VXLAN

4. vCloud Allocation Pool Usable Memory

5. New vSphere 5.x environment

6. Improve Performance for BCAs on Cisco UCS

7. (More Coming Soon)

WINNER ROUND ONE: Use of RDMs in Standard IaaS Clusters by Chris Jones @cpjones44

This design decision works around some fairly strict constraints, such as no >2TB LUNs, no IP based storage & the inability for monitoring solution to be customized.

While the decision is ultimately fairly straight forward, the decision documents the issue well and justifies the decision and discusses in depth the implications of the decision.

This is an example of a fairly obvious decision (considering the constraints) but shows even where a decision may be obvious, or the only option, that understanding the implications is important. Documenting even obvious decisions is also important so in the event of movement within the team, the solution can be understood by people not involved in the original design process.

RUNNER UP ROUND ONE: TSM backup configuration for PureFlex environment? By Ash Simpson @Yipikaye1

Not unlike Chris Jones’ decision, Ash’s submission works within the constraints of an existing environment, where hardware and software has already been purchased. This is a common issue, where Hardware / Software is purchased before a detailed design phase. This is a huge problem in the industry and I encourage you all to ensure this trend does not continue. Without a detailed design phase, it is not possible to confirm what hardware/software is required, as such hardware/software should only be purchased after the design to completed.

Again this decision is fairly obvious given the constraints, but the decision explains the benefits of this method of configuration and discusses the implications which is important.

The constraints did not list anything preventing purchasing of a different backup solution, although this is somewhat implied by the assumptions.

Congratulations to Chris Jones @cpjones44 & Ash Simpson @Yipikaye1!

Thank you to everyone who submitted design decisions, and I encourage you all to submit new decisions for Round 2 and am looking forward to new competition participants.

SUBMISSIONS FOR ROUND 2 (Closing 31st October 2013)

1. (More Coming Soon)

2. (More Coming Soon)

3. (More Coming Soon)