Expanding Capacity on a Nutanix environment – Design Decisions

I recently saw an article about design decisions around expanding capacity for a HCI platform which went through the various considerations and made some recommendations on how to proceed in different situations.

While reading the article, it really made me think how much simpler this process is with Nutanix and how these types of areas are commonly overlooked when choosing a platform.

Let’s start with a few basics:

The Nutanix Acropolis Distributed Storage Fabric (ADSF) is made up of all the drives (SSD/SAS/SATA etc) in all nodes in the cluster. Data is written locally where the VM performing the write resides and replica’s are distributed based on numerous factors throughout the cluster. i.e.: No Pairing, HA pairs, preferred nodes etc.

In the event of a drive failure, regardless of what drive (SSD,SAS,SATA) fails, only that drive is impacted, not a disk group or RAID pack.

This is key as it limited the impact of the failure.

It is importaint to note, ADSF does not store large objects nor does the file system require tuning to stripe data across multiple drives/nodes. ADSF by default distributes the data (at a 1MB granularity) in the most efficient manner throughout the cluster while maintaining the hottest data locally to ensure the lowest overheads and highest performance read I/O.

Let’s go through a few scenarios, which apply to both All Flash and Hybrid environments.

  1. Expanding capacityWhen adding a node or nodes to an existing cluster, without moving any VMs, changing any configuration or making any design decisions, ADSF will proactively send replicas from write I/O to all nodes within the cluster, therefore improving performance while reactively performing disk balancing where a significant imbalance exists within a cluster.

    This might sound odd but with other HCI products new nodes are not used unless you change the stripe configuration or create new objects e.g.: VMDKs which means you can have lots of spare capacity in your cluster, but still experience an out of space condition.

    This is a great example of why ADSF has a major advantage especially when considering environments with large IO and/or capacity requirements.

    The node addition process only requires the administrator to enter the IP addresses and its basically a one click, capacity is available immediately and there is no mass movement of data. There is also no need to move data off and recreate disk groups or similar as these legacy concepts & complexities do not exist in ADSF.

    Nutanix is also the only platform to allow expanding of capacity via Storage Only nodes and supports VMs which have larger capacity requirements than a single node can provide. Both are supported out of the box with zero configuration required.

    Interestingly, adding storage only nodes also increases performance, resiliency for the entire cluster as well as the management stack including PRISM.

  2. Impact & implications to data reduction of adding new nodesWith ADSF, there are no considerations or implications. Data reduction is truely global throughout the cluster and regardless of hypervisor or if you’re adding Compute+Storage or Storage Only nodes, the benefits particularly of deduplication continue to benefit the environment.

    The net effect of adding more nodes is better performance, higher resiliency, faster rebuilds from drive/node failures and again with global deduplication, a higher chance of duplicate data being found and not stored unnecessarily on physical storage resulting in a better deduplication ratio.

    No matter what size node/s are added & no matter what Hypervisor, the benefits from data reduction features such as deduplication and compression work at a global level.

    What about Erasure Coding? Nutanix EC-X creates the most efficient stripe based on the cluster size, so if you start with a small 4 node cluster your stripe would be 2+1 and if you expand the cluster to 5 nodes, the stripe will automatically become 3+1 and if you expand further to 6 nodes or more, the stripe will become 4+1 which is currently the largest stripe supported.

  3. Drive FailuresIn the event of a drive failure (SSD/SAS or SATA) as mentioned earlier, only that drive is impacted. Therefore to restore resiliency, only the data on that drive needs to be repaired as opposed to something like an entire disk group being marked as offline.

    It’s crazy to think a single commodity drive failure in a HCI product could bring down an entire group of drives, causing a significant impact to the environment.

    With Nutanix, a rebuild is performed in a distributed manner throughout all nodes in the cluster, so the larger the cluster, the lower the per node impact and the faster the configured resiliency factor is restored to a fully resilient state.

At this point you’re probably asking, Are there any decisions to make?

When adding any node, compute+storage or storage only, ensure you consider what the impact of a failure of that node will be.

For example, if you add one 15TB storage only node to a cluster of nodes which are only 2TB usable, then you would need to ensure 15TB of available space to allow the cluster to fully self heal from the loss of the 15TB node. As such, I recommend ensuring your N+1 (or N+2) node/s are equal to the size of the largest node in the cluster from both a capacity, performance and CPU/RAM perspective.

So if your biggest node is an NX-8150 with 44c / 512GB RAM and 20TB usable, you should have an N+1 node of the same size to cover the worst case failure scenario of an NX-8150 failing OR have the equivalent available resources available within the cluster.

By following this one, simple rule, your cluster will always be able to fully self heal in the event of a failure and VMs will failover and be able to perform at comparable levels to before the failure.

Simple as that! No RAID, Disk group, deduplication, compression, failure, or rebuild considerations to worry about.

Summary:

The above are just a few examples of the advantages the Nutanix ADSF provides compared to other HCI products. The operational and architectural complexity of other products can lead to additional risk, inefficient use of infrastructure, misconfiguration and ultimately an environment which does not deliver the business outcome it was originally design to.

The All-Flash Array (AFA) is Obsolete!

Over the last few years, I’ve had numerous customers ask about how Nutanix can support bare metal workloads. Up until recently, I haven’t had an answer the customers have wanted to hear.

As a result, some customers have been stuck using their exisiting SAN or worse still being forced to go out and buy a new SAN.

As a result many customers who have wanted to use or have already deployed hyperconverged infrastructure (HCI) for all other workloads are stuck managing an all flash array silo to service some bare metal workloads.

In June at .NEXT 2016, Nutanix announced Acropolis Block Services (ABS) which now allows bare metal workloads to be serviced by new or existing Nutanix clusters.

ABSoverview

As Nutanix has both hybrid (SSD+SATA) and all-flash nodes, customers can chose the right node type/s for their workloads and present the storage externally for bare metal workloads while also supporting Virtual Machines and Acropolis File Services (AFS) and containers.

So why would anyone buy an all-flash array? Let’s discuss a few scenarios.

Scenario 1: Bare metal workloads

Firstly, what applications even need bare metal these days? This is an important question to ask yourself. Challenge the requirement for bare metal and see if the justifications are still valid and if so, has anything changed which would allow virtualization of the applications. But this is a topic for another post.

If a customer only needs new infrastructure for bare metal workloads, deploying Nutanix and ABS means they can start small and scale as required. This avoids one of the major pitfalls of having to size a monolithic centralised, dual controller storage array.

While some AFA vendors can/do allow for non-disruptive controller upgrades, it’s still not a very attractive proposition, nor is it quick or easy. and reduces resiliency during the process as one of two controllers are offline. Nutanix on the other hand performs one click rolling upgrades which mean the largest the cluster, the lower the impact of an upgrade as it is performed one node at a time without disruption and can also be done without risk of a subsequent failure taking storage offline.

If the environment will only ever be used for bare metal workloads, no problem. Acropolis Block Services offers all the advantages of an All Flash Array, with far superior flexibility, scalability and simplicity.

Advantages:

  1. Start small and scale granularly as required allowing customers to take advantage of newer CPU/RAM/Flash technologies more frequently
  2. Scale performance and capacity by adding node/s
  3. Scale capacity only with storage-only nodes (which come in all flash)
  4. Automatically scale multi-pathing as the cluster expands
  5. Solution can support future workloads including multiple hypervisors / VMs / file services & containers without creating a silo
  6. You can use Hybrid nodes to save cost while delivering All Flash performance for workloads which require it by using VM flash pinning which ensures all data is stored in flash and can be specified on a per disk basis.
  7. The same ability as an all flash array to only add compute nodes.

Disadvantages:

  1. Your all-flash array vendor reps will hound you.

Scenario 2: Mixed workloads inc VMs and bare metal

As with scenario 1, deploying Nutanix and ABS means customers can start small and scale as required. This again avoids the major pitfall of having to size a monolithic centralised, dual controller storage array and eliminates the need for separate environments.

Virtual machines can run on compute+storage nodes while bare metal workloads can have storage presented by all nodes within the cluster, including storage-only nodes. For those who are concerned about (potential but unlikely) noisy neighbour situations, specific nodes can also be specified while maintaining all the advantages of Nutanix one-click, non-disruptive upgrades.

Advantages:

  1. Start small and scale granularly as required allowing customers to take advantage of newer CPU/RAM/Flash technologies more frequently
  2. Scale performance and capacity by adding node/s
  3. Scale capacity only with storage-only nodes (which also come in all flash)
  4. Automatically scale multi-pathing for bare metal workloads as the cluster expands
  5. Solution can support future workloads including multiple Hypervisors / VMs / file services & containers without creating a silo.

Disadvantages:

  1. Your All-Flash array vendor reps will hound you.

What are the remaining advantages of using an all flash array?

In all seriousness, I can’t think of any but for fun let’s cover a few areas you can expect all-flash array vendors to argue.

Performance

Ah the age old appendage measuring contest. I have written about this topic many times, including in one of my most popular posts “Peak performances vs Real world performance“.

The fact is, every storage product has limits, even all-flash arrays and Nutanix. The major difference is that Nutanix limits are per cluster rather than per Dual Controller Pair, and Nutanix can continue to scale the number of nodes in a cluster and continue to increase performance. So if ultimate performance is actually required, Nutanix can continue to scale to meet any performance/capacity requirements.

In fact, with ABS the limit for performance is not even at the cluster layer as multiple clusters can provide storage to the same bare metal server/s while maintaining single pane of glass management through PRISM Central.

I recently completed some testing with where I demonstrated the performance advantage of storage only nodes for virtual machines as well as how storage-only nodes improve performance for bare metal servers using Acropolis Block Services which I will be publishing results for in the near future.

Data Reduction

Nutanix has had support for deduplication, compression for a long time and introduced Erasure Coding (EC-X) mid 2015. Each of these technologies are supported when using Acropolis Block Services (ABS).

As a result, when comparing data reduction with all-flash array vendors, while the implementation of these data reduction technologies varies between vendors, they all achieves similar data reduction ratios when applied to the same dataset.

Beware of some vendors who include things like backups in their deduplication or data reduction ratios, this is very misleading and most vendors have the same capabilities. For more information on this see: Deduplication ratios – What should be included in the reported ratio?

Cost

Here we should think about what are the age old problems are with centralized shared storage (like AFAs)? Things like choosing the right controllers and the fact when you add more capacity to the storage, you’re not (or at least rarely) scaling the controller/s at the same time come to mind immediately.

With Nutanix and Acropolis Block Services you can start your All Flash solution with three nodes which means a low capital expenditure (CAPEX) and then scale either linearly (with the same node types) or non-linearly (with mixed types or storage only nodes) as you need to without having to rip and replace (e.g.: SAN controller head swaps).

Starting small and scaling as required also allows you to take advantage of newer technologies such as newer Intel chipsets and NVMe/3D XPoint to get better value for your money.

Starting small and scaling as required also minimizes – if not eliminates – the risk of oversizing and avoids unnecessary operational expenses (OPEX) such as rack space, power, cooling. This also reduces supporting infrastructure requirements such as networking.

Summary:

As shown below, the Nutanix Acropolis Distributed Storage Fabric (ADSF) can support almost any workload from VDI to mixed server workloads, file, block , big data, business critical applications such as SAP / Oracle / Exchange / SQL and bare metal workloads without creating silos with point solutions.

NutanixSingleFabricAllWorkloads

In addition to supporting all these workloads, Nutanix ADSF scalability both from a capacity/performance and resiliency perspective ensures customers can start small and scale when required to meet their exact business needs without the guesswork.

With these capabilities, the All-Flash array is obsolete.

I encourage everyone to share (constructively) your thoughts in the comments section.

Note: You must sign in to comment using WordPress, Facebook, LinkedIn or Twitter as Anonymous comments will not be approved,

Related Articles:

  1. Things to consider when choosing infrastructure.

  2. Scale out performance testing with Nutanix Storage Only Nodes

  3. What’s .NEXT 2016 – Acropolis Block Services (ABS)

  4. Scale out performance testing of bare metal workloads on Acropolis Block Services (Coming soon)

  5. What’s .NEXT 2016 – Any node can be storage only

  6. What’s .NEXT 2016 – All Flash Everywhere!

Scale out performance testing with Nutanix Storage Only Nodes

At Nutanix inaugural user conference in 2015, Storage Only nodes were announced which allowed customers for the first time to scale capacity without having to add compute nodes. This allows customers more flexibility and eliminates the need to license the storage nodes for vSphere as storage only nodes run Acropolis Hypervisor (AHV) and are managed entirely through PRISM.

A common question from prospective and existing Nutanix customers is what if my VMs storage exceeds the capacity of a Nutanix node? The answer is detailed in this blog post but in short, as the Acropolis Distributed Storage Fabric (ADSF) distributes data throughout the cluster at a 1MB granularity, a VMs storage can exceed the local node and performance even improves including reads from the capacity (SAS/SATA) tier.

Storage only nodes were previously limited to the NX-6035C (and Dell XC/Lenovo HX equivalents) but at Nutanix .NEXT conference in Las Vegas 2016, it was announced that any node (including all-flash) can be a storage only node.

This means even for high performance and/or high capacity environments, Nutanix clusters can be scaled without the need to add compute node or purchase additional licensing if you are running vSphere as the hypervisor.

However to date Nutanix are yet to publish any performance data showing the value of storage only nodes, so I decided to run a few tests and demonstrate the value of the Acropolis Distributed Storage Fabric (ADSF) and Storage Only Nodes.

Before we get to the performance data, to avoid competitors inevitable attempts to create FUD about Nutanix performance, I will not be publishing the exact specifications of the node types, drive or Jetstress configurations. I will be publishing the IOPS/latency and database creation, duplication and checksumming durations of the direct comparisons which clearly show the performance advantage of storage only nodes.

Jetstress was not configured to demonstrate maximum performance of the underlying Nutanix solution, it was configured to achieve around 1000 IOPS which is typically higher than even a large Exchange deployment requires per instance. This also allows this test to demonstrate how performance improves when the cluster is performing real world levels of IO (at least in the case of Exchange for this example).

The performance advantage will vary between node types and based on how many storage only nodes are added to the cluster. But the point of this example is to show that ADSF is a truely distributed storage fabric and the storage only nodes and additional Nutanix Controller VMs (CVMs) servicing replication (RF) traffic and remote reads significantly improves performance for VMs residing on the Compute+Storage nodes.

Test Overview:

The first test will be performed using four Jetstress VMs running on a four node cluster. The second test will be performed after an additional four storage only nodes are added to the cluster to form an eight node cluster. Before the second test the cluster will be wiped of all data with the exception of the Windows 2012 R2 template and all Jetstress DBs will be created from scratch so we can compare DB creation as well as performance and DB checksumming durations. Wiping all data also ensures there is no pre-warming of the extent cache (in memory read cache) or metadata cache.

Test Preparation:

I performed a cluster stop / cluster destroy / cluster create to ensure the cluster is totally clean and that we have a fair baseline for the test. The cluster was made up of four nodes.

I then created a base Windows 2012 R2 virtual machine with 4 PVSCSI adapters and 9 vDisks, one for the OS, 4 for the DBs and 4 for the logs. DB drives were formatted with 64k allocation size and log drives with 4k as the different allocation size and seperate virtual disks has shown approx 25% performance improvement in my testing not to mention I recommend In-Line compression and Erasure Coding (EC-X) for Exchange databases and no data reduction for logs.

Jetstress was configured to use 80% of the vDisks capacity which resulted in approx 80% of the Nutanix storage pool capacity being utilised for the test. I will point out these were not low capacity nodes such as NX3060s so the database creation time is significant because there was lots of data to create.

I then cloned the VM 3 times and spread the 4 VMs across 4 Nutanix Nodes running ESXi 5.5 Update 3.

Test 1: Create Databases and run 2hr test

The databases creation phase creates one database, then Jetstress duplicates the database in this case 3 times and immediately after creation the performance test begins.

Note: No data reduction was used for this test as it will result in unrealistic data reduction and performance results as I described in the post Jetstress Testing with Intelligent Tiered Storage Platforms.

I configured Jetstress in this way to ensure the extent cache (in memory read cache) was not pre-warmed and so the results of the test would be fair and repeatable.

Once the performance test completed, I waited for each test to complete before allowing the database checksum validation task to complete. (This is done by using the Multi-host option in Jetstress).

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress4NodesSummary

Observations from Test 1:

  1. We achieved the desired >1000 IOPS per VM
  2. Performance was consistent across all Jetstress instances
  3. Log writes were in the 1ms range as they were serviced by the ADSF Oplog (persistent write buffer)
  4. Database reads were on average just under 10ms which is well below the Microsoft recommended 20ms
  5. The Database creation time averaged 2hrs 24mins
  6. The duplication of 3 databases averaged 4hrs 17mins
  7. The database checksum took on average around 38mins

Test 2: Delete all data, Add four nodes to the cluster & repeat test 1

All Jetstress VMs were deleted and a full curator scan manually initiated to ensure all data was fully removed from disk prior to beginning the next test which ensured a fair baseline.

Four Jetstress VMs were then deployed from the same template, powered on and the saved Jetstress configuration was applied before beginning the test.

Note: The Jetstress thread count was not changed and remains the same as for Test 1.

As with Test 1 the databases creation phase created one database, then Jetstress duplicates the database 3 times and immediately after creation the performance test begins and ran for the same 2hr duration.

The results for each of the four Jetstress VMs are shown below including the average across the VMs for each of the difference metrics.

Jetstress8NodesSummary

Observations from Test 2:

  1. Achieved IOPS jumped by almost 2x
  2. Log writes average latency was lower by 13%
  3. Database write latency dropped by >20%
  4. Database read latency dropped by almost 2x
  5. The Database creation time was just under 15 mins faster
  6. The duplication of 3 databases improved by almost 35 mins
  7. The database checksum was 40 seconds faster.

Without changing the Jetstress thread count, due to the improved performance of the cluster the achieved IOPS jumped by 2x!!

Summary:

These tests is a clear demonstration of the scalability advantage of the Acropolis Distributed Storage Fabric (ADSF) and storage only nodes for customers wanting to increase performance and/or capacity in their HCI environment.

The ability of ADSF to distribute write IO across all nodes within a cluster means write performance improves significantly with the addition of nodes (including storage only) to the cluster while reducing read and write latency due to the decreased workload on the compute + storage nodes servicing the VMs.

But data locality is lost with storage only nodes, right?

Wrong! Storage only nodes actually improve (yes, improve!) data locality by maximising the amount of available space on the compute+storage nodes. This is as a direct result of storage only nodes accepting replication data for write IO and storing the 2nd or 3rd copies (in the case of RF3) on the storage only nodes. This is also demonstrated by the lower read latency observed during this test.

Storage only nodes not only improve the performance and capacity for Virtual machines, but also for physical servers using Acropolis Block Services (ABS) and users of Acropolis File Services (AFS) both of which had enhancements announced at .NEXT 2016 this year.