Expanding Capacity on a Nutanix environment – Design Decisions

I recently saw an article about design decisions around expanding capacity for a HCI platform which went through the various considerations and made some recommendations on how to proceed in different situations.

While reading the article, it really made me think how much simpler this process is with Nutanix and how these types of areas are commonly overlooked when choosing a platform.

Let’s start with a few basics:

The Nutanix Acropolis Distributed Storage Fabric (ADSF) is made up of all the drives (SSD/SAS/SATA etc) in all nodes in the cluster. Data is written locally where the VM performing the write resides and replica’s are distributed based on numerous factors throughout the cluster. i.e.: No Pairing, HA pairs, preferred nodes etc.

In the event of a drive failure, regardless of what drive (SSD,SAS,SATA) fails, only that drive is impacted, not a disk group or RAID pack.

This is key as it limited the impact of the failure.

It is importaint to note, ADSF does not store large objects nor does the file system require tuning to stripe data across multiple drives/nodes. ADSF by default distributes the data (at a 1MB granularity) in the most efficient manner throughout the cluster while maintaining the hottest data locally to ensure the lowest overheads and highest performance read I/O.

Let’s go through a few scenarios, which apply to both All Flash and Hybrid environments.

  1. Expanding capacityWhen adding a node or nodes to an existing cluster, without moving any VMs, changing any configuration or making any design decisions, ADSF will proactively send replicas from write I/O to all nodes within the cluster, therefore improving performance while reactively performing disk balancing where a significant imbalance exists within a cluster.

    This might sound odd but with other HCI products new nodes are not used unless you change the stripe configuration or create new objects e.g.: VMDKs which means you can have lots of spare capacity in your cluster, but still experience an out of space condition.

    This is a great example of why ADSF has a major advantage especially when considering environments with large IO and/or capacity requirements.

    The node addition process only requires the administrator to enter the IP addresses and its basically a one click, capacity is available immediately and there is no mass movement of data. There is also no need to move data off and recreate disk groups or similar as these legacy concepts & complexities do not exist in ADSF.

    Nutanix is also the only platform to allow expanding of capacity via Storage Only nodes and supports VMs which have larger capacity requirements than a single node can provide. Both are supported out of the box with zero configuration required.

    Interestingly, adding storage only nodes also increases performance, resiliency for the entire cluster as well as the management stack including PRISM.

  2. Impact & implications to data reduction of adding new nodesWith ADSF, there are no considerations or implications. Data reduction is truely global throughout the cluster and regardless of hypervisor or if you’re adding Compute+Storage or Storage Only nodes, the benefits particularly of deduplication continue to benefit the environment.

    The net effect of adding more nodes is better performance, higher resiliency, faster rebuilds from drive/node failures and again with global deduplication, a higher chance of duplicate data being found and not stored unnecessarily on physical storage resulting in a better deduplication ratio.

    No matter what size node/s are added & no matter what Hypervisor, the benefits from data reduction features such as deduplication and compression work at a global level.

    What about Erasure Coding? Nutanix EC-X creates the most efficient stripe based on the cluster size, so if you start with a small 4 node cluster your stripe would be 2+1 and if you expand the cluster to 5 nodes, the stripe will automatically become 3+1 and if you expand further to 6 nodes or more, the stripe will become 4+1 which is currently the largest stripe supported.

  3. Drive FailuresIn the event of a drive failure (SSD/SAS or SATA) as mentioned earlier, only that drive is impacted. Therefore to restore resiliency, only the data on that drive needs to be repaired as opposed to something like an entire disk group being marked as offline.

    It’s crazy to think a single commodity drive failure in a HCI product could bring down an entire group of drives, causing a significant impact to the environment.

    With Nutanix, a rebuild is performed in a distributed manner throughout all nodes in the cluster, so the larger the cluster, the lower the per node impact and the faster the configured resiliency factor is restored to a fully resilient state.

At this point you’re probably asking, Are there any decisions to make?

When adding any node, compute+storage or storage only, ensure you consider what the impact of a failure of that node will be.

For example, if you add one 15TB storage only node to a cluster of nodes which are only 2TB usable, then you would need to ensure 15TB of available space to allow the cluster to fully self heal from the loss of the 15TB node. As such, I recommend ensuring your N+1 (or N+2) node/s are equal to the size of the largest node in the cluster from both a capacity, performance and CPU/RAM perspective.

So if your biggest node is an NX-8150 with 44c / 512GB RAM and 20TB usable, you should have an N+1 node of the same size to cover the worst case failure scenario of an NX-8150 failing OR have the equivalent available resources available within the cluster.

By following this one, simple rule, your cluster will always be able to fully self heal in the event of a failure and VMs will failover and be able to perform at comparable levels to before the failure.

Simple as that! No RAID, Disk group, deduplication, compression, failure, or rebuild considerations to worry about.

Summary:

The above are just a few examples of the advantages the Nutanix ADSF provides compared to other HCI products. The operational and architectural complexity of other products can lead to additional risk, inefficient use of infrastructure, misconfiguration and ultimately an environment which does not deliver the business outcome it was originally design to.

What’s .NEXT 2016 – Any node can be storage only

Nutanix has had a lot of success with our storage only nodes which since they were released back at .NEXT 2015 and they have made their way into deals of all shapes and sizes.

For those of you not familiar with this capability, Nutanix provides the ability to scale the Acropolis Distributed Storage Fabric (ADSF) for both storage capacity and performance as well as increase the resiliency of the management layer without the need to scale CPU/RAM (and license the servers for vSphere if that is your chosen hypervisor).

ComputeStorageStorageOnly

Storage only nodes run AHV and are interoperable with Hyper-V and vSphere environments while being managed centrally through PRISM.

One common request I have heard is “Can we turn our existing storage heavy nodes into storage only?”.

The first time I heard this request I why a little surprised, and asked why.

The customer responded with something to the effect of: “We want to reduce our vSphere licensing and we only purchased these nodes for additional capacity”.

Reducing licensing of vSphere as well as applications like Oracle and SQL are of course a fairly common requests these days so Nutanix went away and investigated a number of options.

In addition to storage only nodes (NX-6035C) being very successful, customers have also asked if they can have different sized storage only nodes where performance was more of a requirement than capacity. For example, NX-8150 with 4 x SSDs and 20 HDDs or the NX-3060 nodes with 2 x SSDs and 4 x SATA, again this was not an option so we took this feedback onboard.

The more recent request has been from our All Flash (NX-9000) customers who want to scale capacity and performance but not compute. While adding NX9000 nodes to the cluster achieved the technical outcome required (increasing capacity and performance) it did so at least for vSphere customers at the additional cost of vSphere licensing which is a pain point for many customers who are yet to convert to AHV.

I am pleased to say all these problems have now been solved!

Customers can now convert any existing node/s into storage only nodes (which run AHV).

Some of the advantage of this include:

  • Removing the requirement for vSphere customers to license nodes unnecessarily when scaling the storage layer
  • Allowing capacity & performance scalability for high performance (inc. All Flash) environments
  • Allowing increased capacity/performance and resiliency with homogeneous clusters.

It is also possible to deploy any new nodes, regardless of node type as a storage only node to scale clusters out further, again without vSphere licensing.

Also announced at .NEXT 2016 was Acropolis Block Services (ABS) and Acropolis File Services (AFS) which provide a highly resilient/performant and scalable block and file services to both virtual and physical servers.

Combine ABS and AFS with any node being able to be deployed or converted to a storage only node and you have very comprehensive platform which helps eliminate the need for silos of infrastructure for different use cases.

For Business Critical Applications such as Oracle and MS SQL, it is now possible to scale capacity/performance and resiliency even for All-Flash solutions without incurring additional software licensing costs from Oracle and Microsoft.

It’s really simple, purchase and license the nodes you require from a CPU/RAM perspective with the storage configuration you want (e.g.: Hybrid or All-Flash), then use CTO to purchase storage only nodes (either hybrid or all-flash) and add those to the cluster.

The following shows as example of what this may look like.

OracleSQLStorageOnly

Storage only nodes help reduce to RF overhead on the nodes running VMs which frees up more local space so the applications benefit more from data locality. The additional storage only nodes also increase write performance as more Nutanix CVMs and high performance SSD drives reside in the cluster, making for a higher performance and more consistent outcome.

The below shows how a cluster may have for example eight nodes running and licensed for vSphere with the remaining nodes providing performance/capacity/resiliency running AHV free of licensing costs.

vSphereStorageOnly

Summary:

  1. Supports Acropolis Block Services (ABS) & Acropolis File Services (AFS) allowing non-disruptive scaling of VM and bare metal servers storage performance/capacity
  2. Enables All Flash environments to scale performance/capacity
  3. Reduces application licensing costs (e.g.: Oracle/SQL) while scaling storage capacity/performance
  4. For vSphere customers who have more compute nodes than they need, one or more of these nodes can be converted to AHV to provide storage only and reduce vSphere licensing costs
  5. Allows vSphere customers to only license 2 nodes in minimum 3 node Nutanix configurations without reducing the resiliency of the ADSF cluster

Related .NEXT 2016 Posts

NOS 4.5 Delivers Increased Read Performance from SATA

In a recent post I discussed how NOS 4.5 increases the effective SSD tier capacity by performing up-migrations on only the local extent as opposed to both RF copies within the Nutanix cluster. In addition to this significant improvement in usable SSD tier, in NOS 4.5 the read performance from the SATA tier has also received lots of attention from Nutanix engineers.

What the Solutions and Performance Engineering team have discovered and been testing is how we can improve SATA performance. Now ideally the active working set for VMs will fit within the SSD tier, and the changes discussed in my previous post dramatically improve the chances of that active working set fitting within the SSD tier.

But there are situation when reads to cold data still need to be serviced by the slow SATA drives. Nutanix uses Data Locality to ensure the hot data remains close to the application to deliver the lowest latency and overheads which improve performance, but in the case of SATA drives and the fact data is infrequently accessed from SATA means that reading from remote SATA drives can improve performance especially where the number of local SATA drives is limited (in some cases to only 2 or 4 drives).

Most Nutanix nodes have 2 x SSD and 4 x SATA so best case you will only see a few hundred IOPS from SATA as that is all they are physically capable of. To get around this issue.

NOS 4.5 introduces some changes to the way in which we select a replica to read an egroup from the HDD tier. Periodically NOS (re)calculate the average IO latencies of the all the replicas of a vdisk’s (replicas which have the vdisk’s egroups). We use this information to choose a replica as follows:

  1. If the latency of the local replica is less than a configurable threshold, read from the local replica.
  2. If the latency of the local replica is more than a configurable threshold, and the latency of the remote replica is more than that of the local replica, prefer the local replica.
  3. If the latency of the local replica is more than a configurable threshold and the remote replica is lower than the configurable threshold OR lower than the local copy, prefer the remote replica.

The diagram below shows an example of where the VM on Node A is performing random reads to data A and shortly thereafter data C. When requesting reads from data A the latency is below the threshold but when it requests data C, NOS detects that the latency of the local copy is higher than the remote copy and selects the remote replica to read from. As the below diagram shows, one possible outcome when reading multiple pieces of data is one read is served locally and the other is serviced remotely.

remotesatareads2

Now the obvious next question is “What about Data Locality”.

Data Locality is being maintained for the hot data which resides in SSD tier because reads from SSD are faster and have lower overheads on CPU/Network etc when read locally due to the speed of SSDs. For SATA reads which are typical >5ms the SATA drive itself is the bottleneck not the network, so by distributing the Reads across more SATA drives even if they are not local, results in better overall performance and lower latency.

Now if the SSD tier has not reached 75% all data will be within the SSD tier and will be served locally, the above feature is for situations where the SSD tier is 75% full and data is being tiered to SATA tier AND random reads are occurring to cold data OR data which will not fit in the SSD tier such as very large databases.

In addition NOS 4.5 detects if the read I/O is random or sequential, and if its sequential (which SATA performance much better at) then the up-migration of data has a higher threshold to meet before being migrated to SSD.

The result of these algorithm improvements (and the increased SSD tier effective capacity discussed earlier) and Nutanix In-line compression is higher performance over larger working sets which also exceed the capacity of the SSD tier.

Effectively NOS 4.5 is delivering a truly scale out solution for read I/O from SATA tier which means one VM can be reading from potentially all nodes in the cluster ensuring SATA performance for things like Business Critical Applications is both high and consistent. Combine that with NX-6035C storage only nodes, this means SATA read I/O can be scaled out as shown in the below diagram without scaling compute.

ScaleOutRemoteReads

 

As we can see above, the Storage only Nodes (NX-6035C) are delivering additional performance for read I/O from the SATA tier (as well as from the SSD tier).