Solving Oracle & SQL Licensing challenges with Nutanix

The Nutanix platform has and will continue to evolve to meet/exceed the ever increasing customer and application requirements while working within constraints such as licensing.

Two of the most common workloads which I work frequently with customers to design solutions around real or perceived licensing constraints are Oracle and SQL.

In years gone by, Nutanix solutions were constrained to being built around a limited number of node types. When I joined in 2013 only one type existed (NX-3450) which limited customers flexibility and often led to paying more for licensing than a traditional 3-tier solution.

With that said, the ROI and TCO for the Nutanix solutions back then were still more often than not favourable compared to 3-tier but these days we only have more and more good news for prospective and existing customers.

Nutanix has now rounded out the portfolio with the introduction of “Compute Only” nodes to target a select few niche workloads with real or perceived licensing and/or political constraints.

Compute only nodes compliment the traditional HCI nodes (Compute+Storage) as well as our unique Storage Only Nodes which were introduced in mid 2015.

So how do Compute Only nodes help solve these licensing challenges?

In short, Oracle leads the world in misleading and intimidating customers into paying more for licensing than what they need to. One of the most ridiculous claims is “You must license every physical CPU core in your cluster because Oracle could run or have ran on it”.

The below tweet makes fun of Oracle and shows how ridiculous their claim that customers need to license every node in a cluster (which I’ve never seen referenced in any actual contract) is.

So let’s get to how you can design a Nutanix solution to meet a typical Oracle customer licensing constraint while ensuring excellent Scalability, Resiliency and Performance.

At this stage we now assume you’ve given your first born child and left leg to Oracle and have subsequently been granted for example 24 physical core licenses from Oracle, what next?

If we we’re to use HCI nodes, some of the CPU would be utilised by the Nutanix Controller VM (CVM) and while the CVM does add a lot of value (see my post Cost vs Reward for the Nutanix Controller VM) you may be so constrained by licensing that you want to maximise the CPU power for just Oracle workloads.

Now in this example, we have 24 licensed physical cores, so we could use two Compute Only nodes using an Intel Gold 6128 [6 cores / 3.4 GHz] / 12 cores per server for 24 total physical cores.

Next we would assess the storage capacity, resiliency and performance requirements and decide how many and what configuration storage only nodes are required.

Because Virtual Machines cannot run on storage only nodes, the Oracle Virtual Machines cannot and will never run on any other CPU cores other than the two Compute Only nodes therefore you would be in compliance with your licensing.

The below is an example of what the environment could look like.

2CO_4SOnodes

SQL has ever changing CPU licensing models which in some cases are licensed by server or vCPU count, Compute Only can be used in the same way I explained above to address any SQL licensing constraints.

What about if I need to scale storage capacity and/or performance?

You’re in luck, without any modifications to the Oracle workloads, you can simply add one or more storage only nodes to the cluster and it will almost immediately increase capacity, performance and resiliency!

I’ve published an example of the performance improvement by adding storage only nodes to a cluster in an article titled Scale out performance testing with Nutanix Storage Only Nodes which I wrote back in 2016.

In short, the results show by doubling the number of nodes from 4 to 8, the performance almost exactly doubled while delivering low read and write latency.

What if you’ve already invested in Nutanix HCI nodes (example below) and are running Oracle/SQL or any other workloads on the cluster?

TypicalHCIcluster

Nutanix provides the ability to convert a HCI node into a Storage Only node which results in preventing Virtual Machines from running on that node. So all you need to do is add two or more Compute Only nodes to the cluster, then mark the existing HCI nodes as Storage Only and the result is shown below.

CO_PlusConvertedHCI

This is in fact the minimum supported configuration for Compute Only Environments to ensure minimum levels of resiliency and performance. For more information, check out my post “Nutanix Compute Only Minimum requirements“.

Now we have two nodes (Compute Only) which can run Virtual Machines and four nodes (HCI nodes converted to Storage Only) which are servicing the storage I/O. In this scenario, if the HCI nodes have unused CPU and/or RAM the Nutanix Controller VM (CVM) can also be scaled up to drive higher performance & lower latency.

Compute Only is currently available with the Nutanix Next Generation Hypervisor “AHV”.

Now let’s cover off a few of the benefits of running applications like Oracle & SQL on Nutanix:

  1. No additional Virtualization licensing (AHV is included when purchasing Nutanix AOS)
  2. No rip and replace for existing HCI investment
  3. Unique scale out distributed storage fabric (ADSF) which can be easily scaled as required
  4. Storage Only nodes add capacity, performance and resiliency to your mission critical workloads without incurring additional hypervisor or application licensing costs
  5. Compute Only allows scale up and out of CPU/RAM resources where applications are constrained by ONLY CPU/RAM and/or application software licensing.
  6. Storage Only nodes can also provide functions such as Nutanix Files (previously known as Acropolis File Services or AFS)

As a result of Nutanix now having HCI, Storage Only and Compute Only nodes, we’re now entering the time where Nutanix can truely be the standard platform for almost any workload including those with non technical constraints such as political or application licensing which have traditionally been at least perceived to be an advantage for legacy SAN products.

The beauty of the Nutanix examples above is while they look like a traditional 3-tier, we avoid the legacy SAN problems including:

1. Rip and Replace / High Impact / High Risk Controller upgrades/scalability
2. Difficulty in scaling performance with capacity
3. Inability to increase resiliency without adding additional Silos of storage (i.e.: Another dual controller SAN)

With Compute Only being supported by AHV, we also help customers avoid the unnecessary complexity and related operational costs of managing ESXi deployments which have become increasingly more complex over time without significantly improving value to the average customer who simply wants high performance, resilient and easy to manage virtualisation solution.

But what about VMware ESXi customers?

Obviously moving to AHV would be ideal but for those who cannot for whatever reasons can still benefit from Storage Only nodes which provide increased storage performance and resiliency to the Virtual machines running on ESXi.

Customers can run ESXi on Nutanix (or OEM / Software Only) HCI nodes and then scale the clusters performance/capacity with AHV based storage only nodes, therefore eliminating the need to license both ESXi and Oracle/SQL since no virtual machine will run on these nodes.

How does Nutanix compare to a leading all flash array?

For those of you who would like to see a HCI only Nutanix solution have better TCO as well as performance and capacity than a leading All Flash Array, checkout A TCO Analysis of Pure FlashStack & Nutanix Enterprise Cloud where even with giving every possible advantage to Pure Storage, Nutanix still comes out on top without data reduction assumptions.

Now consider that Nutanix the TCO as well as performance and capacity was better than a leading All Flash Array with only HCI nodes, imagine the increased efficiency and flexibility by being able to mix/match HCI, with Storage Only and Compute only.

This is just another example of how Nutanix is eliminating even the corner use cases for traditional SAN/NAS.

For more information about Nutanix Scalability, Resiliency and Performance, checkout this multi-part blog series.

Nutanix Scalability – Part 2 – Compute (CPU/RAM)

Following on from Part 1 of the Scalability series where we discussed how Nutanix can scale storage capacity seperate to compute, the next obvious topic is to talk about scaling CPU and Memory resources at both the workload and cluster level.

Let’s first recap the problems with scaling compute with traditional shared storage.

HCInotHCI

Yuk! That looks like old school 3-tier stuff to me!

Non HCI workloads on compute only nodes would therefore:

  • Be running in the same setup as traditional 3-tier infrastructure
  • Have different performance than HCI based workloads
  • Loose the advantage of having compute + storage close together
  • Increase dependency on Network
  • Impact network utilization of HCI node/s
  • Impact benefits of HCI for the native HCI workloads and much more.

The industry has accepted HCI as they way of the future and while adding compute only nodes might sound nice at a high level, its just re-introducing the classic 3-tier complexity and problems of the past when if we review the actual requirements it’s very rare to see a Nutanix node have insufficient resources when sized/configured correctly.

Customers are often surprised when they show me their workloads and I don’t seem surprised by the CPU/RAM or storage IO or capacity requirements. I can’t tell you how many times I’ve made statements like “You’re applications requirements are not that high, I’ve seen much worse!”.

Examples of scaling compute with Nutanix

Example 1: Scaling up a Virtual Machines compute resources:

SQL/Oracle DBA: Our application is growing/running slowly, We need more CPU/RAM!

Nutanix: You have several options:

a) Scale up the virtual machine’s vCPUs and vRAM to match the size of the NUMA node.
b) Scale up the virtual machine’s vCPUs and vRAM to be the same number of pCore’s as the host minus the Nutanix CVM vCPUs and do the same with the RAM.

The first option is the optimal as it will ensure maximum memory performance as the CPU will be assessing memory within the NUMA boundary, however the second option is still viable and for applications such as SQL, the impact of insufficient memory can be higher than the penalty of crossing a NUMA boundary.

BUT MY WORKLOAD IS UNIQUE, IT NEEDS A PHYSICAL SERVER!!

Despite hearing these type of statements by prospective and existing customers, Very few workloads actually need more CPU/RAM that a modern Nutanix (or OEM/Software only) node can provide even if you remove resources for the Controller VM (CVM). I find that it’s usually only a perceived requirement for physical servers and in reality, a reasonably sized VM on a standard node will deliver the desired business outcome/s comfortably.

Currently Nutanix NX nodes support Intel Platinum 8180 processors which have 28 physical cores @ 2.5 GHz per socket for a total of 56 physical cores (112 threads).

If you had say an existing physical server using a fairly modern Intel Broadwell E5-2699 v4 with dual 22 physical core processors, you have a total SPECint_rate of 1760 or 40 per core.

Compare that to the Intel Platinum 8180 processor and you have a total SPECint_rate of 2720 or 48.5 per core.

This is an increase per core of 21.25%.

So if you’re moving from that physical server using Intel Broadwell E5-2699 v4 CPUs (44 cores) and you move that workload to Nutanix with ZERO CPU overcommitment (vCPU:pCore ratio 1:1) using the Intel Platinum 8180 processor, assuming we reserve 8 pCores for the CVM we still have 48 pCores for the SQL VM.

That’s a SpecIntRate of 2328 which is higher than the physical server using all cores.

That’s over 32% more CPU performance for the Virtual Machine compared to the dedicated physical server.

The reality is the Nutanix CVM and Acropolis Distributed Storage Fabric (ADSF) provides high performance, low latency storage which also drives further CPU efficiency by eliminating CPU WAIT (CPU cycles wasted waiting for I/O to complete).

As you can see from this simple example, a Virtual Machine on Nutanix can easily replace even a modern physical server and even provide better performance with only one generation newer CPU. Think about how your 3-5 year old physical servers will feel when they jump multiple generations of CPU and get scale out flash based storage.

Example 2: A VM (genuinely) needs more CPU/RAM than Nutanix nodes have.

SQL/Oracle DBA: Our application/s is needs more CPU/RAM than our biggest node/s can provide.

Nutanix: You have several options:

a) Purchase one or more larger node (e.g.: NX-8035-G6 w/ Intel Platinum or Gold Processors, add them to the existing cluster and live migrate your VM/s to that/those nodes. Use affinity rules to keep critical VMs on the highest performance nodes.

Nutanix supports mixing different hardware types/generations in the same cluster and this can be a preferred option over creating a dedicated cluster for several reasons.

  • Larger clusters provide more targets for replication traffic (i.e.: RF2 or RF3) meaning lower average write latency
  • Larger clusters provide higher resiliency as they can potentially tolerate more failures and rebuild follow a drive/node or nodes failing faster.
  • Larger clusters help ensure the impact of a failure is lower as a lower percentage of cluster resources are lost

b) Purchase one or more larger node (e.g.: NX-8035-G6 w/ Intel Platinum or Gold Processors, and create a new cluster and migrate your VM/s to that cluster.

A dedicated cluster may sound attractive, but in most cases I recommend mix workload clusters as they ultimately provide higher performance, resiliency and flexibility.

c) Scale out your workloads

Applications like MS Exchange, MS SQL and Oracle RAC can (and arguably should) be scaled out rather than scaled up as doing so provides increased performance, resiliency and reduces overall infrastructure costs (e.g.: More cheaper/smaller processors can be used as opposed to premium processors like Intel Platinum series).

One large VM hosting dozens of databases is rarely a good idea, so scale out and run more VMs, distributed across your Nutanix cluster and spread the workload across all the VMs.

For 99% of workloads, I do not see the real world value of compute only nodes. But there are always exceptions to every rule.

Potential Exceptions:

Example 3: Re-using existing hardware

SQL DBA: I love my Nutanix gear (duh!) but I have some physical servers which wont be end of life for 12 months, can I continue using them with Nutanix?

Nutanix: We have several options:

a) If the hardware is on our Software-only hardware compatibility list (HCL), it’s possible you can purchase SW-only licenses and deploy Nutanix on your existing hardware.

b) Use Nutanix Acropolis Block Services (ABS) to provide highly available scale out storage to your physical server via iSCSI.

ABS was released in 2015 and supports SCSI-3 persistent reservations for shared storage-based Windows clusters, which are commonly used with Microsoft SQL Server and clustered file servers.

ABS supports several use cases, including:

  • iSCSI for Microsoft Exchange Server.
  • Shared storage for Linux-based clusters
  • Windows Server Failover Clustering (WSFC).
  • SCSI-3 persistent reservations for shared storage-based Windows clusters
  • Shared storage for Oracle RAC environments.
  • Bare-metal environments.

Therefore ABS allows you to re-use your existing hardware to maximise your return on investment (ROI) while getting the benefits of ADSF. Once the hardware is end of life, the storage already on the Nutanix cluster can be quickly presented to a VM so the workload will benefit from the full Nutanix HCI experience.

Future Capabilities:

In late 2017, Nutanix announced Nutanix Acropolis Compute Cloud (AC2) which will provide the ability to have true compute-only nodes in a Nutanix cluster as shown below.

I reluctantly mention this upcoming capability because I do not want to see customers go back to a 3-teir model or think that HCI isn’t the way forward because it is. That’s not what compute-only is about.

This capability is specifically designed to work around the niche circumstances where a software vendor such as Oracle, are extorting customers from a licensing perspective and it’s desirable to maximise the CPU cores the application can use.

Let me have a quick rant and put an end to the nonsense before it gets out of hand:

IT IS NOT FOR GENERAL VM USE!

NO ITS NOT FOR PERFORMANCE REASONS.

NO NUTANIX IS NOT MOVING BACK TO A 3-TIER COMPUTE+STORAGE MODEL.

HCI WITH NUTANIX IS STILL THE WAY FORWARD

Summary:

Nutanix provides excellent scalability at the CPU/RAM level for both virtual and physical workloads. In rare circumstances where physical servers are a real (or likely just a perceived) requirement, ABS can be used while Nutanix will soon also provide Compute-only for AHV customers to ensure licensing value is maximised for those rare cases.

Back to the Scalability, Resiliency and Performance Index.

Sizing infrastructure based on vendor Data Reduction assumptions – Part 2

In part 1, we discussed how data reduction ratios can, and do, vary significantly between customers and datasets and that making assumptions on data reduction ratios, even when vendors provide guarantees, does not protect you from potentially serious problems if the data reduction ratios are not achieved.

In Part 2 we will go through an example of how misleading data reduction guarantees can be.

One HCI manufacturer provides a guarantee promising 10:1 which sounds too good to be true, and that’s because it, quite frankly, isn’t true. The guarantee includes a significant caveat for the 10:1 data reduction:

The savings/efficiency are based on the assumption that you configure a backup policy to take at least one <redacted> backup per day of every virtual machine on every<redacted> system in a given VMware Datacenter with those backups retained for 30 days.

I have a number of issues with this limitation including:

  1. The use of the word “backup” referring directly/indirectly to data reduction (savings)
  2. The use of the word “backup” when referring to metadata copies within the same system
  3. No actual deduplication or compression is required to achieve the 10:1 data reduction because metadata copies (or what the vendor incorrectly calls “backups”) are counted towards deduplication.

It is important to note, I am not aware of any other vendor who makes the claim that metadata copies ( Snapshots / Point in time copies / Recovery points etc.) are deduplication. They simply are not.

I have previously written about what should be counted in deduplication ratios, and I encourage you to review this post and share your thoughts as it is still a hot topic and one where customers are being oversold/mislead regularly in my experience.

Now let’s do the math on my claim that no actual deduplication or compression is required to achieve the 10:1 ratio.

Let’s use a single 1 TB VM as a simple example. Note: The size doesn’t matter for the calculation.

Take 1 “backup” (even though we all know this is not a backup!!) per day for 30 days and count each copy as if it was a full backup to disk, Data logically stored now equals 31TB  (1 TB + 30 TB).

The actual Size on disk is only a tiny amount of metadata higher than the original 1TB as the metadata copies pointers don’t create any copies of the data which is another reason it’s not a backup.

Then because these metadata copies are counted as deduplication, the vendor reports a data efficiency of 31:1 in its GUI.

Therefore, Effective Capacity Savings = 96.8% (1TB / 31TB = 0.032) which is rigged to be >90% every time.

So the only significant capacity savings which are guaranteed come from “backups” not actual reduction of the customer’s data from capacity saving technologies.

As every modern storage platform I can think of has the capability to create metadata based point in time recovery points, this is not a new or even a unique feature.

So back to our topic, if you’re sizing your infrastructure based on the assumption of the 10:1 data efficiency, you are in for rude shock.

Dig a little deeper into the “guarantee” and we find the following:

It’s the ratio of storage capacity that would have been used on a comparable traditional storage solution to the physical storage that is actually used in the <redacted> hyperconverged infrastructure.  ‘Comparable traditional solutions’ are storage systems that provide VM-level synchronous replication for storage and backup and do not include any deduplication or compression capability.

So if you, for example, had a 5 year old NetApp FAS, and had deduplication and/or compression enabled, the guarantee only applies if you turned those features off, allowed the data to be rehydrated and then compared the results with this vendor’s data reduction ratio.

So to summarize, this “guarantee” lacks integrity because of how misleading it is. It  is worthless to any customer using any form of enterprise storage platform probably in the last 5 –  10 years as the capacity savings from metadata based copies are, and have been, table stakes for many, many years from multiple vendors.

So what guarantee does that vendor provide for actual compression and deduplication of the customers data? The answer is NONE as its all metadata copies or what I like to call “Smoke and Mirrors”.

Summary:

“No one will question your integrity if your integrity is not questionable.” In this case the guarantee and people promoting it have questionable integrity especially when many customers may not be aware of the difference between metadata copies and actual copies of data, and critically when it comes to backups. Many customers don’t (and shouldn’t have too) know the intricacies of data reduction, they just want an outcome and 10:1 data efficiency (saving) sounds to any reasonable person as they need 10x less than I have now… which is clearly not the case with this vendors guarantee or product.

Apart from a few exceptions which will not be applicable for most customers, 10:1 data reduction is way outside the ballpark of what is realistically achievable without using questionable measurement tactics such as counting metadata copies / snapshots / recovery points etc.

In my opinion the delta in the data reduction ratio between all major vendors in the storage industry for the same dataset, is not a significant factor when making a decision on a platform. This is because there are countless other substantially more critical factors to consider. When the topic of data reduction comes up in meetings I go out of my way to ensure the customer understands this and has covered off the other areas like availability, resiliency, recoverability, manageability, security and so on before I, quite frankly waste their time talking about table stakes capability like data reduction.

I encourage all customers to demand nothing less of vendors than honestly and integrity and in the event a vendor promises you something, hold them accountable to deliver the outcome they promised.