Problem: ROBO/Dark Site Management, Solution: XCP + AHV

Problem:

Remote office / Branch Office commonly referred to as “ROBO” and dark sites (i.e.: offices without local support staff and/or network connectivity to a central datacenter) are notoriously difficult to design, deploy and manage.

Why have infrastructure at ROBO?

The reason customers have infrastructure at ROBO and/or Dark Sites is because these sites require services which cannot be provided centrally due to any number of constraints such as WAN bandwidth/latency/availability or, more frequently, security constraints.

Challenges:

Infrastructure at ROBO and/or dark sites need to be functional, highly available and performant without complexity. The problem is as the functional requirements of the ROBO/dark Sites are typically not dissimilar to the infrastructure in the datacenter/s, the complexity of these sites can be equal to the primary datacenter if not greater due to the reduced budgets for ROBOs.

This means in many cases the same management stack needs to be designed on a smaller scale, deployed and somehow managed at these remote/secure sites with minimal to no I.T presence onsite.

Alternatively, Management may be ran centrally but this can have its own challenges especially when WAN links are high latency/low bandwidth or unreliable/offline.

Typical ROBO deployment requirements.

Typical requirements are in many cases not dis-similar to those of the SMB or enterprise and include things like High Availability (HA) for VMs, so a minimum of 2 nodes and some shared storage. Customers also want to ensure ROBO sites can be centrally managed without deployment of complex tooling at each site.

ROBO and Dark Sites are also typically deployed because in the event of WAN connectivity loss, it is critical for the site to continue to function. As a result, it is also critical for the infrastructure to gracefully handle failures.

So let’s summarise typical ROBO requirements:

  • VM High Availability
  • Shared Storage
  • Be fully functional when WAN/MAN is down
  • Low/no touch from I.T
  • Backup/Recovery
  • Disaster Recovery

Solution:

Nutanix Xtreme Computing Platform (XCP) including PRISM and Acropolis Hypervisor (AHV).

Now let’s dive into with XCP + PRISM + AHV is a great solution for ROBO.

A) Native Cross Hypervisor & Cloud Backup/Recovery & DR

Backup/Recovery and DR are not easy things to achieve or manage for ROBO deployments. Luckily these capabilities are built-in to Nutanix XCP. This includes the ability to take point in time application consistent snapshots and replicate those to local/remote XCP clusters & Cloud Providers (AWS/Azure). These snapshots can be considered backups once replicated to a 2nd location (ideally offsite) as well as be kept locally on primary storage for fast recovery.

ROBO VMs replicated to remote/central XCP deployments can be restored onto either ESXi or Hyper-V via the App Mobility Fabric (AMF) so running AHV at the ROBO has no impact on the ability to recover centrally if required.

This is just another way Nutanix is ensuring customer choice and proves the hypervisor is well and truely a commodity.

In addition XCP supports integration with the market leader in data protection, Commvault.

B) Built in Highly Available, Distributed Management and Monitoring

When running AHV, all XCP, PRISM and AHV management, monitoring and even the HTML 5 GUI are built in. The management stack requires no design, sizing, installation , scaling or 3rd party backend database products such as SQL/Oracle.

For those of you familiar with the VMware stack, XCP + AHV provides capabilities provided by vCenter, vCenter Heartbeat, vRealize Operations Manager, Web Client, vSphere Data Protection, vSphere Replication. And it does this in a highly available and distributed manner.

This means, in the event of a node failure, the management layer does not go down. If the Acropolis Master node goes down, the Master roles are simply taken over by an Acropolis Slave within the cluster.

As a result, the ROBO deployment management layer is self healing which dramatically reduces the complexity and and all but removes the requirement for onsite attendance by I.T.

C) Scalability and Flexibility

XCP with AHV ensures than even when ROBO deployments need to scale to meet compute or storage requirements, the platform does not need to be re-architected, engineered or optimised.

Adding a node is as simple as plugging it in, turning it on and the cluster can be expanded not disruptively via PRISM (locally or remotely) in just a few clicks.

When the equipment becomes end of life, XCP also allows nodes to be non-disruptively removed from clusters and new nodes added, which means after the initial deployment, ongoing hardware replacements can be done without major redesign/reconfiguration of the environment.

In fact, deployment of new nodes can be done by people onsite with minimal I.T knowledge and experience.

D) Built-in One Click Maintenance, Upgrades for the entire stack.

XCP supports one-click, non-disruptive upgrades of:

  • Acropolis Base Software (NDSF layer),
  • Hypervisor (agnostic)
  • Firmware
  • BIOS

This means there is no need for onsite I.T staff to perform these upgrades and XCP eliminates potential human error by fully automating the process. All upgrades are performed one node at a time and only started if the cluster is in a resilient state to ensure maximum uptime. Once one node is upgraded, it is validated as being successful (Similar to a Power on self test or POST) before the next node proceeds. In the event an upgrade fails, the cluster will remain online as I have described in this post.

These upgrades can also be done centrally via PRISM Central.

E) Full Self Healing Capabilities

As I have already touched on, XCP + AHV is a fully self healing platform. From the Storage (NDSF) layer to the virtualization layer (AHV) through to management (PRISM) the platform can fully self heal without any intevenston from I.T admins.

With Nutanix XCP you do not need expensive hardware support contracts or to worry about potential subsequent failures, because the system self heals and does not depend on hardware replacement as I have described in hardware support contracts & why 24×7 4 hour onsite should no longer be required.

Anyone who has ever managed a multi-site environment knows how much effort hardware replacement is, as well as the fact that replacements must be done in a timely manner which can delay other critical work. This is why Nutanix XCP is designed to be distributed and self healing as we want to reduce the workload for sysadmins.

F) Ease of Deployment

All of the above features and functionality can be quickly/easily deployed from out of the box to fully operational ready to run VMs in just minutes.

The Management/Monitoring solutions do not require detailed design (sizing/configuration) as they are all built in and they scale as nodes are added.

G) Reduced Total Cost of Ownership (TCO)

When it comes down to it, ROBO deployments can be critical to the success of a company and trying to do things “cheaper” rarely ends up actually being cheaper. Nutanix XCP may not be the cheapest (CAPEX) but we will be the lowest TCO which is after all what matters.

If you’re a sysadmin and you don’t think you can get any more efficient after reading the above than what you’re doing today, its because you already run XCP + AHV 🙂

In all seriousness, sysadmin’s should be innovating and providing value back to the business. If they are instead spending any significant time “keeping the lights on” for ROBO deployments then their valuable time is not being well utilised.

Summary:

Nutanix XCP + AHV provides all the capabilities required for typical ROBO deployments while reducing the initial implementation and ongoing operational cost/complexity.

With Acropolis Operating System 4.6 and the cross hypervisor backup/recovery/DR capabilities thanks to the App Mobility Fabric (AMF), there is no need to be concerned about the underlying hypervisor as it has become a commodity.

AHV performance and availability is on par if not better than other hypervisors on the market as is clear from several points we have discussed.

Related Articles:

  1. Why Nutanix Acropolis hypervisor (AHV) is the next generation hypervisor
  2. Hardware support contracts & why 24×7 4 hour onsite should no longer be required.

Nutanix Data Protection Capabilities

There is a lot of misinformation being spread in the HCI space about Nutanix data protection capabilities. One such example (below) was published recently on InfoStore.

Evaluating Data Protection for Hyperconverged Infrastructure

When I see articles like this, It really makes me wonder about the accuracy of content on these type of website as it seems articles are published without so much as a brief fact check from InfoStore.

None the less, I am writing this post to confirm what Data Protection Capabilities Nutanix provides.

  • Native In-Built Data protection

Prior to my joining Nutanix in mid-2013, Nutanix already provided a Hypervisor agnostic Integrated backup and disaster recovery solution with centralised consumer- grade management through our PRISM GUI which is HTML 5 based.

The built in capabilties are flexible and VM-centric policies to protect virtualized applications with different RPOs and RTOs with or without application consistency.

The solution also supports Local, remote, and cloud-based backups, and synchronous and asynchronous replication-based disaster recovery solutions.

Currently supported cloud targets include AWS and Azure as shown below.

CloudBackup

The below video which shows in real time how to create Application consistent snapshots from the Nutanix PRISM GUI.

Nutanix can also perform One to One, One to Many and Many to One replication of application consistent snapshots to onsite or offsite Nutanix clusters as well as Cloud providers (AWS/Azure), ensuring choice and flexibility for customers.

Nutanix native data protection can also replicate between and recover VMs to clusters of different hypervisors.

  • CommVault Intellisnap Integration

Nutanix also provides integration with Commvault Intellisnap which allows existing Commvault customers to continue leveraging their investment in the market leading data protection product and to take advantage of other features where required.

The below shows how agentless backups of Virtual Machines is supported with Acropolis Hypervisor (AHV). Note: Commvault is also fully supported with Hyper-V and ESXi.

By Commvault directly calling the Nutanix Distributed Storage Fabric (NDSF) it ensures snapshots are taken quickly and efficiently without the dependancy on a hypervisor.

  • Hypervisor specific support such as VMware API Data Protection (VADP)

Nutanix also supports solutions which leverage VADP, allowing customers with existing investment in products such as Veeam & Netbackup to continue with their existing strategy until such time as they want to migrate to Nutanix native data protection or solutions such as Commvault.

  • In-Guest Agents

Nutanix supports the use of In-Guest agents which are typically very inefficient with centralised SAN/NAS storage but due to data locality and NDSF being a truly distributed platform, In-Guest Incremental forever backups perform extremely well on Nutanix as the traditional choke points such as Network, Storage Controllers & RAID packs have been eliminated.

Summary:

As one size does not fit all in the world of I.T, Nutanix provides customers choice to meet a wide range of market segments and requirements with strong native data protection capabilities as well as 3rd party integration.

The truth about Storage Data efficiency ratios.

We’ve all heard the marketing claims from some storage vendors about how efficient their storage products are. Data efficiency ratios of 40:1 , 60:1 even 100:1 continue to be thrown around as if they are amazing, somehow unique or achieved as a result of proprietary hardware.

Let’s talk about how vendors may try to justify these crazy ratios:

For many years, Storage vendors have been able to take space efficient copies of LUNs, Datastores, Virtual Machines etc which rely on snapshots or metadata. These are not full copies and reporting this as data efficiency is quite mis-leading in my opinion as this is and has been for many years Table stakes.

Be wary of vendors encouraging (or requiring) you configure more frequent “backups” (which are after all just Snapshots or metadata copies) to achieve the advertised data efficiencies.

  • Reporting VAAI/VCAI clones as full copies

If I have a VMware Horizon View environment, It makes sense to use VAAI/VCAI space efficient clones as they provide numerous benefits including faster provisioning, recompose and use less space which leads to them being served from cache (making performance better).

So if I have an environment with just 100 desktops deployed via VCAI, You have a 100:1 data reduction ratio, 1000 desktops and you have 1000:1. But this is again Table stakes… well sort of because some vendors don’t support VAAI/VCAI and others only have partial support as I discuss in Not all VAAI-NAS storage solutions are created equal.

Funnily enough, one vendor even offloads what VAAI/VCAI can do (with almost no overhead I might add) to proprietary hardware. Either way, while VAAI/VCAI clones are fantastic and can add lots of value, claiming high data efficiency ratios as a result is again mis-leading especially if done so in the context of being a unique capability.

  • Compression of Highly compressible data

Some data, such as Logs or text files are highly compressible, so ratios of >10:1 for this type of data are not uncommon or unrealistic. However consider than if logs only use a few GB of storage, then 10:1 isn’t really saving you that much space (or money).

For example a 100:1 data reduction ratio of 100MB of logs is only saving you ~10GB which is good, but not exactly something to make a purchasing decision on.

Also compression of databases which lots of white space also compress very well, so the larger the Initial size of the DB, the more it will compress.

The compression technology used by storage vendors is not vastly different, which means for the same data, they will all achieve a similar reduction ratio. As much as I’d love to tell you Nutanix has much better ratios than Vendors X,Y and Z, its just not true, so I’m not going to lie to you and say otherwise.

  • Deduplication of Data which is deliberately duplicated

An example of this would be MS Exchange Database Availability Groups (DAGs). Exchange creates multiple copies of data across multiple physical or virtual servers to provide application and storage level availability.

Deduplication of this is not difficult, and can be achieved (if indeed you want to dedupe it) by any number of vendors.

In a distributed environment such as HCI, you wouldn’t want to deduplicate this data as it would force VMs across the cluster to remotely access more data over the network which is not what HCI is all about.

In a centralised SAN/NAS solution, deduplication makes more sense than for HCI, but still, when an application is creating the duplicate data deliberately, it may be a good idea to exclude it from being deduplicated.

As with compression, for the same data, most vendors will achieve a similar ratio so again this is table stakes no matter how each vendor tries to differentiate. Some vendors dedupe at more granular levels than others, but this provides diminishing returns and increased overheads, so more granular isn’t always going to deliver a better business outcome.

  • Claiming Thin Provisioning as data efficiency

If you have a Thin Provisioned 1TB virtual disk and you only write 50GB to the disk, you would have a data efficiency ratio of 20:1. So the larger you create your virtual disk and the less data you write to it, the better the ratio will be. Pretty silly in my opinion as Thin Provisioning is nothing new and this is just another deceptive way to artificially improve data efficiency ratios.

  • Claiming removal of zeros as data reduction

For example, if you create an Eager Zero Thick VMDK, then use only a fraction, as with the Thin Provisioning example (above), removal of zeros will obviously give a really high data reduction ratio.

However Intelegent storage doesn’t need Eager Zero Thick (EZT) VMDKs to give optimal performance nor will they write zeros to begin with. Intelligent storage will simply store metadata instead of a ton of worthless zeros. So a data reduction ratio from a more intelligent storage solution would be much lower than a vendor who has less intelligence and has to remove zeros. This is yet another reason why data efficiency (marketing) numbers have minimal value.

Two of the limited use cases for EZT VMDKs is Fault Tolerance (who uses that anyway) and Oracle RAC, so removal of zeros with intelligent storage is essentially moot.

Summary:

Data reduction technologies have value, but they have been around for a number of years so if you compare two modern storage products, you are unlikely to see any significant difference between vendor A and B (or C,D,E,F and G).

The major advantage of data reduction is apparent when comparing new products with 5+ year old technology. If you are in this situation where you have very old tech, most newer products will give you a vast improvement, it’s not unique to just one vendor.

At the end of the day, there are numerous factors which influence what data efficiency ratio can be achieved by a storage product. When comparing between vendors, if done in a fair manner, the differences are unlikely to be significant enough to sway a purchasing decision as most modern storage platforms have more than adequate data reduction capabilities.

Beware: Dishonest and mis-leading marketing about data reduction is common so don’t get caught up in a long winded conversations about data efficiency or be tricked into thinking one vendor is amazing and unique in this area, it just isn’t the case.

Data reduction is table stakes and really shouldn’t be the focus of a storage or HCI purchasing decision.

My recommendation is focus on areas which deliver operational simplicity, removes complexity/dependancies within the datacenter and achieve real business outcomes.

Related Posts:

1. Sizing infrastructure based on vendor Data Reduction assumptions – Part 1

2. Sizing infrastructure based on vendor Data Reduction assumptions – Part 2

3.Deduplication ratios – What should be included in the reported ratio?