Problem: ROBO/Dark Site Management, Solution: XCP + AHV

Problem:

Remote office / Branch Office commonly referred to as “ROBO” and dark sites (i.e.: offices without local support staff and/or network connectivity to a central datacenter) are notoriously difficult to design, deploy and manage.

Why have infrastructure at ROBO?

The reason customers have infrastructure at ROBO and/or Dark Sites is because these sites require services which cannot be provided centrally due to any number of constraints such as WAN bandwidth/latency/availability or, more frequently, security constraints.

Challenges:

Infrastructure at ROBO and/or dark sites need to be functional, highly available and performant without complexity. The problem is as the functional requirements of the ROBO/dark Sites are typically not dissimilar to the infrastructure in the datacenter/s, the complexity of these sites can be equal to the primary datacenter if not greater due to the reduced budgets for ROBOs.

This means in many cases the same management stack needs to be designed on a smaller scale, deployed and somehow managed at these remote/secure sites with minimal to no I.T presence onsite.

Alternatively, Management may be ran centrally but this can have its own challenges especially when WAN links are high latency/low bandwidth or unreliable/offline.

Typical ROBO deployment requirements.

Typical requirements are in many cases not dis-similar to those of the SMB or enterprise and include things like High Availability (HA) for VMs, so a minimum of 2 nodes and some shared storage. Customers also want to ensure ROBO sites can be centrally managed without deployment of complex tooling at each site.

ROBO and Dark Sites are also typically deployed because in the event of WAN connectivity loss, it is critical for the site to continue to function. As a result, it is also critical for the infrastructure to gracefully handle failures.

So let’s summarise typical ROBO requirements:

  • VM High Availability
  • Shared Storage
  • Be fully functional when WAN/MAN is down
  • Low/no touch from I.T
  • Backup/Recovery
  • Disaster Recovery

Solution:

Nutanix Xtreme Computing Platform (XCP) including PRISM and Acropolis Hypervisor (AHV).

Now let’s dive into with XCP + PRISM + AHV is a great solution for ROBO.

A) Native Cross Hypervisor & Cloud Backup/Recovery & DR

Backup/Recovery and DR are not easy things to achieve or manage for ROBO deployments. Luckily these capabilities are built-in to Nutanix XCP. This includes the ability to take point in time application consistent snapshots and replicate those to local/remote XCP clusters & Cloud Providers (AWS/Azure). These snapshots can be considered backups once replicated to a 2nd location (ideally offsite) as well as be kept locally on primary storage for fast recovery.

ROBO VMs replicated to remote/central XCP deployments can be restored onto either ESXi or Hyper-V via the App Mobility Fabric (AMF) so running AHV at the ROBO has no impact on the ability to recover centrally if required.

This is just another way Nutanix is ensuring customer choice and proves the hypervisor is well and truely a commodity.

In addition XCP supports integration with the market leader in data protection, Commvault.

B) Built in Highly Available, Distributed Management and Monitoring

When running AHV, all XCP, PRISM and AHV management, monitoring and even the HTML 5 GUI are built in. The management stack requires no design, sizing, installation , scaling or 3rd party backend database products such as SQL/Oracle.

For those of you familiar with the VMware stack, XCP + AHV provides capabilities provided by vCenter, vCenter Heartbeat, vRealize Operations Manager, Web Client, vSphere Data Protection, vSphere Replication. And it does this in a highly available and distributed manner.

This means, in the event of a node failure, the management layer does not go down. If the Acropolis Master node goes down, the Master roles are simply taken over by an Acropolis Slave within the cluster.

As a result, the ROBO deployment management layer is self healing which dramatically reduces the complexity and and all but removes the requirement for onsite attendance by I.T.

C) Scalability and Flexibility

XCP with AHV ensures than even when ROBO deployments need to scale to meet compute or storage requirements, the platform does not need to be re-architected, engineered or optimised.

Adding a node is as simple as plugging it in, turning it on and the cluster can be expanded not disruptively via PRISM (locally or remotely) in just a few clicks.

When the equipment becomes end of life, XCP also allows nodes to be non-disruptively removed from clusters and new nodes added, which means after the initial deployment, ongoing hardware replacements can be done without major redesign/reconfiguration of the environment.

In fact, deployment of new nodes can be done by people onsite with minimal I.T knowledge and experience.

D) Built-in One Click Maintenance, Upgrades for the entire stack.

XCP supports one-click, non-disruptive upgrades of:

  • Acropolis Base Software (NDSF layer),
  • Hypervisor (agnostic)
  • Firmware
  • BIOS

This means there is no need for onsite I.T staff to perform these upgrades and XCP eliminates potential human error by fully automating the process. All upgrades are performed one node at a time and only started if the cluster is in a resilient state to ensure maximum uptime. Once one node is upgraded, it is validated as being successful (Similar to a Power on self test or POST) before the next node proceeds. In the event an upgrade fails, the cluster will remain online as I have described in this post.

These upgrades can also be done centrally via PRISM Central.

E) Full Self Healing Capabilities

As I have already touched on, XCP + AHV is a fully self healing platform. From the Storage (NDSF) layer to the virtualization layer (AHV) through to management (PRISM) the platform can fully self heal without any intevenston from I.T admins.

With Nutanix XCP you do not need expensive hardware support contracts or to worry about potential subsequent failures, because the system self heals and does not depend on hardware replacement as I have described in hardware support contracts & why 24×7 4 hour onsite should no longer be required.

Anyone who has ever managed a multi-site environment knows how much effort hardware replacement is, as well as the fact that replacements must be done in a timely manner which can delay other critical work. This is why Nutanix XCP is designed to be distributed and self healing as we want to reduce the workload for sysadmins.

F) Ease of Deployment

All of the above features and functionality can be quickly/easily deployed from out of the box to fully operational ready to run VMs in just minutes.

The Management/Monitoring solutions do not require detailed design (sizing/configuration) as they are all built in and they scale as nodes are added.

G) Reduced Total Cost of Ownership (TCO)

When it comes down to it, ROBO deployments can be critical to the success of a company and trying to do things “cheaper” rarely ends up actually being cheaper. Nutanix XCP may not be the cheapest (CAPEX) but we will be the lowest TCO which is after all what matters.

If you’re a sysadmin and you don’t think you can get any more efficient after reading the above than what you’re doing today, its because you already run XCP + AHV 🙂

In all seriousness, sysadmin’s should be innovating and providing value back to the business. If they are instead spending any significant time “keeping the lights on” for ROBO deployments then their valuable time is not being well utilised.

Summary:

Nutanix XCP + AHV provides all the capabilities required for typical ROBO deployments while reducing the initial implementation and ongoing operational cost/complexity.

With Acropolis Operating System 4.6 and the cross hypervisor backup/recovery/DR capabilities thanks to the App Mobility Fabric (AMF), there is no need to be concerned about the underlying hypervisor as it has become a commodity.

AHV performance and availability is on par if not better than other hypervisors on the market as is clear from several points we have discussed.

Related Articles:

  1. Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 8 – Analytics (Performance / Capacity Management)

Acropolis provides a powerful yet simple-to-use Analysis solution which covers the Acropolis Platform, Compute (Acropolis Hypervisor / Virtual Machines) and Storage (Distributed Storage Fabric).

Unlike other Analysis solutions, Acropolis requires no additional software licensing, management infrastructure or virtual machines/applications to design/deploy or configure. The Nutanix Controller VM includes built-in Analysis which have no external dependencies. There is no need to extract/import data into another product or Virtual appliance meaning lower overheads e.g.: Less data is required to be stored and less impact on storage.

Not only is this capability built in day one, but as the environment grows over time, Acropolis automatically scales the analytics capability; there is never a tipping point where you need to deploy additional instances, increase compute/storage resources assigned to Analytics Virtual Appliances or deploy additional back end databases.

For a demo of the Analysis UI see the following YouTube Video from 4:50 onwards.

Summary:

  1. In-Built analysis solution
  2. No additional licensing required
  3. No design/implementation or deployment of VMs/appliances required
  4. Automatically scales as the XCP cluster/s grow

Lower overheads due to being built into Acropolis and utilizing the Distributed Storage Fabric

Back to the Index

Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor – Part 6 – Performance

When talking about performance, it’s easy to get caught up in comparing unrealistic speed and feeds such as 4k I/O benchmarks. But, as any real datacenter technology expert knows, IOPS are just a small piece of the puzzle which, in my opinion, get far too much attention as I discussed in my article Peak Performance vs Real World Performance.

When I talk about performance, I am referring to all the components within the datacenter including the Management components, Applications/VMs, Analytics, Data Resiliency and everything in between.

Let’s look at a few examples of how Nutanix XCP running Acropolis Hypervisor (AHV) ensures consistent high performance for all components:

Management Performance:

The Acropolis management layer includes the Acropolis Operating System (formally NOS), Prism (HTML 5 GUI) and Acropolis Hypervisor (AHV) management stack made up of “Master” and “Slave” instances.

This architecture ensures all CVMs actively and equally contribute to ensuring all areas of the platform continue running smoothly. This means there is no central application, database or component which can cause a bottleneck, being fully distributed is key to delivering a web-scale platform.

AcropolisCluster1

Each Controller VM (CVM) runs the components required to manage the local node and contribute to the distributed storage fabric and management tasks.

For example: While there is a single Acropolis “Master” it is not a single point of failure nor is it a performance bottleneck.

The Acropolis Master is responsible for the following tasks:

  1. Scheduler for HA
  2. Network Controller
  3. Task Executors
  4. Collector/Publisher of local stats from Hypervisor
  5. VNC Proxy for VM Console connections
  6. IP address management

Each Acropolis Slave  is responsible for the following tasks:

  1. Collector/Publisher of local stats from Hypervisor
  2. VNC Proxy for VM Console connections

Regardless of being a Master or Slave, each CVM performs the two heaviest tasks: The Collection & Publishing of Hypervisor stats and, when in use, the VM console connections.

The distributed nature of the XCP platform allows it too achieve consistently high performance. Sending stats to a central location such as a central management VM and associated database server not only can become a bottleneck, but without introducing some form of application level HA (e.g.: SQL Always On Availability Group) it also could be a single point of failure which is for most customers unacceptable.

The roles which are performed by the Acropolis Master are all lightweight tasks such as the HA scheduler, Network Controller, IP address management and Task Executor.

The HA scheduler task is only active in the event of a node failure which makes it a very low overhead for the Master. The Network Controller task is only active when tasks such as new VLANs are being configured and Task Execution is simply keeping track of all tasks and distributing them for execution across all CVMs. IP address management is essentially a DHCP service, which is also an extremely low overhead.

In part 8, we will discuss more about Acropolis Analytics.

Data Locality

Data locality is a unique feature of XCP where new I/O writes to the local node where the VM is running as well as replicated to other node/s within the cluster. Data locality eliminates the requirement for servicing subsequent Read I/O by traversing the network and utilizing a remote controller.

As VMs migrate around a cluster, Write I/O is always written locally and remote reads will only occur if remote data is accessed. If data is remote and never accessed, no remote I/O will occur. As a result, it is typical for >90% of I/O to be serviced locally.

Currently bandwidth and latency across a well designed 10Gb network may not be an issue for some customers, however as flash performance exponentially increases the network could quite easily become a major bottleneck without moving to expensive 40Gb (or higher) networking. Data locality helps minimize the dependency on the network by servicing the majority of Read I/O locally and by writing one copy locally it reduces the overheads on the network for Write I/O. Therefore Data Locality allows customers to run lower cost networking without compromising performance.

While data locality works across all supported hypervisors,  AHV is unique as it supports data-aware virtual machine placement:  Virtual Machines are powered onto the node with the highest percentage of local data for that VM which minimizes the chance of remote I/O and reduces the overheads involved in servicing I/O for each VM following failures or maintenance.

In addition, Data Locality also applies to the collection of back end data for Analysis such as hypervisor and virtual machine statistics. As a result, statistics are written locally and a second (or third for environments configured with RF3) written remotely. This means stats data which can be a significant amount of data has the lowest possible impact on the Distributed File System and cluster as a whole.

Summary:

  1. Management components scale with the cluster to ensure consistent performance
  2. Data locality ensures data is as close to the Compute (VM) as possible
  3. Intelligent VM placement based on Data location
  4. All Nutanix Controller VMs work as a team (not in pairs) to ensure optimal performance of all components and workloads (VMs) in the cluster

Back to the Index