Expanding Capacity on a Nutanix environment – Design Decisions

I recently saw an article about design decisions around expanding capacity for a HCI platform which went through the various considerations and made some recommendations on how to proceed in different situations.

While reading the article, it really made me think how much simpler this process is with Nutanix and how these types of areas are commonly overlooked when choosing a platform.

Let’s start with a few basics:

The Nutanix Acropolis Distributed Storage Fabric (ADSF) is made up of all the drives (SSD/SAS/SATA etc) in all nodes in the cluster. Data is written locally where the VM performing the write resides and replica’s are distributed based on numerous factors throughout the cluster. i.e.: No Pairing, HA pairs, preferred nodes etc.

In the event of a drive failure, regardless of what drive (SSD,SAS,SATA) fails, only that drive is impacted, not a disk group or RAID pack.

This is key as it limited the impact of the failure.

It is importaint to note, ADSF does not store large objects nor does the file system require tuning to stripe data across multiple drives/nodes. ADSF by default distributes the data (at a 1MB granularity) in the most efficient manner throughout the cluster while maintaining the hottest data locally to ensure the lowest overheads and highest performance read I/O.

Let’s go through a few scenarios, which apply to both All Flash and Hybrid environments.

  1. Expanding capacityWhen adding a node or nodes to an existing cluster, without moving any VMs, changing any configuration or making any design decisions, ADSF will proactively send replicas from write I/O to all nodes within the cluster, therefore improving performance while reactively performing disk balancing where a significant imbalance exists within a cluster.

    This might sound odd but with other HCI products new nodes are not used unless you change the stripe configuration or create new objects e.g.: VMDKs which means you can have lots of spare capacity in your cluster, but still experience an out of space condition.

    This is a great example of why ADSF has a major advantage especially when considering environments with large IO and/or capacity requirements.

    The node addition process only requires the administrator to enter the IP addresses and its basically a one click, capacity is available immediately and there is no mass movement of data. There is also no need to move data off and recreate disk groups or similar as these legacy concepts & complexities do not exist in ADSF.

    Nutanix is also the only platform to allow expanding of capacity via Storage Only nodes and supports VMs which have larger capacity requirements than a single node can provide. Both are supported out of the box with zero configuration required.

    Interestingly, adding storage only nodes also increases performance, resiliency for the entire cluster as well as the management stack including PRISM.

  2. Impact & implications to data reduction of adding new nodesWith ADSF, there are no considerations or implications. Data reduction is truely global throughout the cluster and regardless of hypervisor or if you’re adding Compute+Storage or Storage Only nodes, the benefits particularly of deduplication continue to benefit the environment.

    The net effect of adding more nodes is better performance, higher resiliency, faster rebuilds from drive/node failures and again with global deduplication, a higher chance of duplicate data being found and not stored unnecessarily on physical storage resulting in a better deduplication ratio.

    No matter what size node/s are added & no matter what Hypervisor, the benefits from data reduction features such as deduplication and compression work at a global level.

    What about Erasure Coding? Nutanix EC-X creates the most efficient stripe based on the cluster size, so if you start with a small 4 node cluster your stripe would be 2+1 and if you expand the cluster to 5 nodes, the stripe will automatically become 3+1 and if you expand further to 6 nodes or more, the stripe will become 4+1 which is currently the largest stripe supported.

  3. Drive FailuresIn the event of a drive failure (SSD/SAS or SATA) as mentioned earlier, only that drive is impacted. Therefore to restore resiliency, only the data on that drive needs to be repaired as opposed to something like an entire disk group being marked as offline.

    It’s crazy to think a single commodity drive failure in a HCI product could bring down an entire group of drives, causing a significant impact to the environment.

    With Nutanix, a rebuild is performed in a distributed manner throughout all nodes in the cluster, so the larger the cluster, the lower the per node impact and the faster the configured resiliency factor is restored to a fully resilient state.

At this point you’re probably asking, Are there any decisions to make?

When adding any node, compute+storage or storage only, ensure you consider what the impact of a failure of that node will be.

For example, if you add one 15TB storage only node to a cluster of nodes which are only 2TB usable, then you would need to ensure 15TB of available space to allow the cluster to fully self heal from the loss of the 15TB node. As such, I recommend ensuring your N+1 (or N+2) node/s are equal to the size of the largest node in the cluster from both a capacity, performance and CPU/RAM perspective.

So if your biggest node is an NX-8150 with 44c / 512GB RAM and 20TB usable, you should have an N+1 node of the same size to cover the worst case failure scenario of an NX-8150 failing OR have the equivalent available resources available within the cluster.

By following this one, simple rule, your cluster will always be able to fully self heal in the event of a failure and VMs will failover and be able to perform at comparable levels to before the failure.

Simple as that! No RAID, Disk group, deduplication, compression, failure, or rebuild considerations to worry about.

Summary:

The above are just a few examples of the advantages the Nutanix ADSF provides compared to other HCI products. The operational and architectural complexity of other products can lead to additional risk, inefficient use of infrastructure, misconfiguration and ultimately an environment which does not deliver the business outcome it was originally design to.

vSphere | PVSCSI Adapters & striped/spanned NTFS volumes

A little while ago I wrote a post titled “Splitting SQL datafiles across multiple VMDKs for optimal VM performance” where I talked about how SQL databases can be split with minimal/no interruption to production to give better performance by spreading the IO load across multiple PVSCSI adapters and virtual machine disks (VMDKs).

In a follow up post titled “SQL & Exchange performance in a Virtual Machine” I mentioned the above article and concluded:

If the DBA is not confident doing this, you can also just add multiple virtual disks (connected via multiple PVSCSI controllers) and create a stripe in guest (via Disk Manager) and this will also give you the benefit of multiple vdisks.

Both posts have been very popular and one of the comments I got via twitter was that creating striped or spanned NTFS volumes in guest was not supported by VMware when using PVSCSI.

This is stated in VMware KB “Configuring disks to use VMware Paravirtual SCSI (PVSCSI) adapters (1010398)” as shown below:

kbspanned

Prior to writing both posts I was aware of this KB, but after comprehensively testing this numerous times on different platforms over the years, and more recently on Nutanix, I concluded after liaising with many VMware experts (including several VCDXs) that this was either a legacy recommendation which needed to be updated, or simply a mistake by the author of the KB (which can happen as we’re all human).

As such, I followed up with VMware by raising a SR on August 14th 2016.

After following up several times I had given up waiting for an answer but I am pleased to say today (2nd November 2016) I finally got a reply.

vmwaregsspvscsi

In summary, spanned (and stripped volumes which was not mentioned in the KB) are supported and to quote VMware GSS “will have no issues”.

One strong recommendation I have is DO NOT use VMDKs hosted in different failure domains (e.g.: LUNs, SAN/NASs) in the one spanned/striped volume as this increases the size of the failure domain and your chances of the volume going offline.

So there you have it, if you need to increase the performance for an application and you are not confident to split databases at the application level, you can (typically) get increased IO performance by using striped volumes in guest which are quick and easy to setup. The only downside is you will need to take your DB offline to copy it to the new volume before bringing it back online.

Hope this puts peoples mind at ease about striped volumes with PVSCSI.

Splitting SQL datafiles across multiple VMDKs for optimal VM performance

After recently helping multiple customers resolve performance issues with vBCA workloads by configuring multiple PVSCSI adapters and spreading workloads across multiple VMDKs, I wrote: SQL and Exchange performance in a virtual machine.

The post talked about how you should use multiple PVSCSI adapters with multiple VMDKs spread evenly across the adapters to achieve optimal performance and reduce overheads.

But what about if you only have a single SQL database. Can we split it across multiple VMDKs and importantly, can we do this without downtime?

The answer to both, thankfully is Yes!

The below is an example of a worst case scenario for a SQL server database. A single VMDK (using a single SCSI controller) hosting the Operating System, Database and Logs, especially when it’s a business critical application.

In the above scenario the single virtual SCSI controller and/or the single VMDK could both result in lower than expected performance.

We have learned earlier that using multiple PVSCSI adapters and VMDKs is the best way to deploy a high performance solution. The below is an example deployment where the OS , Pagefile and SQL binaries are using one virtual controller and VMDK, then four VMDKs for database files are hosted by a further two PVSCSI controllers and the logs are hosted by a fourth PVSCSI controller and VMDK.

In the above diagram the C:\ is using a LSI Logic controller which in most cases does not constraint performance, however since it’s very easy to change to a PVSCSI controller and there are no significant downsides, I recommend standardizing on PVSCSI.

Now if we look at our current database, we can see it has one database file and one log file as shown below.

The first step is the update the Virtual machines disk layout as describe in the aforementioned article which should end up looking like the below:

Next we go into Disk manager to rescan for the new storage devices, mark the drives are online, then format them with a 64k Allocation size which is optimal for databases. Once this is done you should check My Computer and see something similar to the below:

Next I recommend creating a directory for the database and log files rather than using the root directory so each drive should have a new folder as per the example below.

Next step is to create the new database files on each of new drives as shown below.

If the size of the original database is for example 10GB with say 2GB free space and you plan to split the database across 4 drives, then each of the new databases should be sized at no more than 2GB each to begin with. This prepares us to shrink the original DB and helps ensure the data is evenly spread across the new database files.

In the above screenshot, we can see the databases are limited to 2000MB, this is on purpose as we don’t want the database files expanding which can result in an uneven spread of data during the redistribution process I will cover later.

Switch the Recovery mode of Database to SIMPLE

Now go to the database, navigate to Tasks, Shrink and select “Files”

Now select the “Empty File by migrating data to other files in the same filegroup” option and press “Ok”.

Depending on the size of the database and the speed of the storage this may take some time and it will have at least some impact on the performance of the server. As such I recommend performing the process outside of peak hours if possible.

The error below is expected as we do not want to empty out the first *.mdf file completely. This is also an indication of our tasks being complete for empty file operation to the limit we’ve set earlier.

Once the task has completed you should see a roughly even distribution of data across the four database files by using the script below in query window.

USE tpcc
GO
SELECT DB_NAME() AS DbName,
name AS FileName,
size/128.0 AS CurrentSizeMB,
size/128.0 - CAST(FILEPROPERTY(name, 'SpaceUsed') 
AS INT)/128.0 AS FreeSpaceMB
FROM sys.database_files;

C:\Users\Kasim\AppData\Local\Temp\SNAGHTMLd751ece.PNG

Next we want to configure autogrow onto our databases so they can grow during business as usual operations.

The above shows the database are configured to autogrow by 100MB up to a limit of 2048MB each. The amount a database should autogrow will vary based on the rate of growth in your database, as will the file size limit so consider these values carefully.

Once you have set these settings it’s now time to shrink the original final to the same size as the other database files as shown below:

This process cleans up white space (empty space) within the database.

So far we have achieved the following:

  1. Updated the VM with additional PVSCSI controllers and more VMDKs
  2. Initialized the VMDKs and formatted to the Guest OS
  3. Created three new database files
  4. Balanced the database across the four database file (including the original file)

We have achieved all of this without taking the database offline.

At this stage the virtual machine and SQL can be left as is until such time as you can schedule a short maintenance window to perform the following:

  1. Copy the original DB file from C: to the remaining new database VMDK
  2. Copy the original Logs file from C: to the new logs VMDK

This process only takes a few minutes plus the time to copy the database and logs. The duration of the file copy will depend on the size of your database and the performance of the underlying storage. The good news is with the virtual machine having already been partially optimized with more PVSCSI controllers and VMDKs, the read (copy) process will be served by one SCSI controller/VMDK and the paste (write) process served by another which will minimize the downtime required.

Once you have locked in your maintenance window, all you need to do is ensure all users and applications dependent on the database are shutdown, then detach the database and select the “Drop Connections” and “Update Statistics” and press Ok.


The next steps are very simple; we need to copy (or rather move/cut) the database from the original location as shown below:

Now we paste the database file to the new data1 drive.

Then we copy the log file and paste it into the new log drive.

Now we simply reattach the database specifying the new location of the *.mdf file. You will note the message highlighted below which indicates the log files are not found which is expected since we have just relocated them.

C:\Users\Kasim\AppData\Local\Temp\SNAGHTMLd8094b4.PNG

To resolve this simply update the path to the logs file as shown below and press Ok.

And we’re done! Simple as that.

Adjust the maximum growth of the datafile to an appropriate size. If you set to unlimited, please ensure that you monitor the volumes and manage them according to the growth rate of the database.

Lastly, don’t forget to change the database recovery model to Full

Now you have your OS separated from your SQL database and logs and all of the drives are configured across four virtual SCSI controllers.

Summary:

If you have an existing SQL server and storage performance is considered a problem, before buying new storage (Nutanix or otherwise), ensure you optimize the virtual machines storage layout as the constraint may not be the underlying storage.

As this post explains, most of this optimization can be done without taking the database offline so you don’t really have anything lose in following this process. Worst case scenario is performance does not improve and you have eliminated the VM storage as the constraining factor and when you do implement new Nutanix nodes or any underlying storage, you will get the most out of it. Do follow some other best practices like RAM to vCPU balancing, SQL Memory optimization, Trace Flags and database compression, be it row or page.

Acknowledgements:

A huge thank you to Kasim Hansia from the Nutanix Business Critical Applications (vBCA) team for documenting this process and allowing me to publish this post using his screenshots. It’s a pleasure working with such a talented group at Nutanix both in the vBCA team and in the broader organization.

Related Articles:

  1. SQL and Exchange performance in a virtual machine
  2. How to successfully virtualize Microsoft Exchange
  3. MS support for SQL on NFS datastores