NOS 4.5 Delivers Increased effective SSD tier capacity

In addition to the increased effective SSD (and SATA) tier capacity gained by using Erasure Coding (EC-X) which was announced at the Nutanix .NEXT conference earlier this year, the upcoming NOS (Nutanix Operating System) 4.5 is providing a yet another effective capacity increase for the SSD tier.

Here’s how it works:

The below 4 node cluster has 3 VMs actively using data (known as extents) represented by the A,B,C blocks. This is a very simplified example as VMs will have potentially hundreds or thousands of extents distributed throughout a cluster.

AllHotDataSSD

What we can see in the above diagram is two copies of each piece of data as this is an RF2 deployment. The VM on Node A is using extent A, the VM on Node B is using extent B and the VM on Node C is using extent C.

Because the VMs are using Extents A,B and C, they all remain within the SSD tier including the replicas distributed throughout the cluster. When these extents become cold they will be dynamically moved to the SATA tier.

What is changing in NOS 4.5 is the Nutanix tiering solution called ILM (Intelligent Lifecycle Management) now perform up-migrations (from SATA to SSD) on a per extent basis which means replicas are treated independent of each other. What this means is the hot extents will up-migrate to SSD on the node where the VM is running (via Data Locality) giving all flash performance while the replicas distributed throughout the cluster will remain in the SATA tier as shown below:

PerExtentUpMigrations

As we can see in the above diagram, all copies of A,B,C and D were in the SATA tier. Then the VM on node A started frequently reading from data A and the local extent is therefore up-migrate to SSD.

For the VM on node B, it started frequently accessing data D and B. Data D was up-migrated from local SATA and data B was up-migrated AND localized as it was residing on a remote node. The VM on node C also up-migrated from local SATA the same as VM on node A.

Now we can see that out of the 8 extents, we have 4 which have me up-migrated and localized (where required) and 4 which remain in the low cost SATA tier.

As a result the SSD tiers effective capacity is doubled for RF2 and tripled for RF3. So this means for customers using RF2, the active working set can potentially double while still providing all flash performance.

If data is frequently being overwritten NDFS will detect this and up-migrate both the local and remote copy/copies to ensure write I/O is always serviced by the SSD tier. The below diagram shows Data A being up-migrated to node C SSD tier ready to service the redundant replicas for any write I/O.

PerExtentUpMigrationsWriteIO

As typical mixed workload environments have a higher Read vs Write ratio e.g.: 70/30 the benefits of only up-migrating one extent when it becomes hot is effective for a large percentage of the I/O.

Even in the event the Read vs Write Ratio is reversed e.g.: 30/70 which is typical for VDI environments, the new ILM process will still provide a significant effective increase of the SSD tier by only up-migrating one out of two extents. It should be noted for VDI solutions, VAAI-NAS already provides huge data reduction savings thanks to intelligent cloning and as a result it is not uncommon to find large VDI deployments on Nutanix using only the SSD tier.

Summary:

NOS 4.5 delivers Double or Triple (for RF3) the effective SSD tier capacity in addition to data reduction savings from technologies such as deduplication, compression and Erasure Coding (EC-X). This feature is like most things with Nutanix is hypervisor agnostic!

Not bad for a free software upgrade huh!

Related Posts:

1. Scaling Hyper-converged solutions – Compute only.

2. Advanced Storage Performance Monitoring with Nutanix

3. Nutanix – Improving Resiliency of Large Clusters with Erasure Coding (EC-X)

4. Nutanix – Erasure Coding (EC-X) Deep Dive

5. Acropolis: VM High Availability (HA)

6. Acropolis: Scalability

7. NOS & Hypervisor Upgrade Resiliency in PRISM

Deduplication and MS Exchange

Virtualization and Storage always seem to be a hot topics in regards to Exchange deployments and many of you would have seen my post Virtualizing Exchange on vSphere with NFS backed storage a while back.

This post was motivated by a tweet from fellow VCDX which stated:

dedupe not supported for Exchange, no we can’t turn it off.

Later in the twitter conversation he went on to say

To be clear not an MS employee, another integrator MS “master” certified. It’s the whole NFS thing again

I have heard similar over the years and for me the disappointing thing is the support statement is unclear as are the motivations behind support statements for Exchange in general. e.g.: Support for VMDK on NFS

The only support statement I am aware of regarding Exchange and deduplication is in the technet article “Exchange 2013 storage configuration options” under the section “Volume configurations for the Exchange 2013 Mailbox server role” at it states:

storageexchange

In the above statement which specifically refers to “a new technique to optimize storage utilization for Windows Server 2012” is states that for Stand-alone or High availability solutions de-duplication is not supported for Exchange database file unless the DB files are completely offline and used for backup or archives.

So the first question is “Is array level deduplication supported”?

There is nothing that says that it isn’t supported that I am aware of, so if you are aware of such a statement please let me know in the comments and I will update this post.

My interpretation of the support statement is that array level deduplication is supported and MS have simply called out that the deduplication in Windows 2012 is not. Regardless of if you agree or disagree with my interpretation, I think its safe to say the support statement should be clarified with justification.

The next question I would like to discuss is “Should deduplication be used with Exchange”?

Firstly we should discuss the fact Exchange can be deployed with Database Availability Groups (DAGs) which creates multiple copies of Exchange databases across up to 16 Exchange Mailbox (or Multi-Role) servers.

The purpose of a DAG is to provide high availability for the application and data.

So if the application is by design making duplicate copies, should the storage be undoing this work?

Before I give my opinion on deduplicating DAG copies, I want to be clear on two things:

1. Deduplication is a well proven technology which many different vendors implement either in-line or post process or in some cases both.

2. As array level deduplication is abstracted from the Guest OS and Application, there is no risk to the application such as data corruption or anything like that.

So back to deduplicating DAG copies.

I work for Nutanix and I wrote our best practice guide for Exchange which can be found below. In the guide, I recommended Compression but not deduplication. In an upcoming update of the document the recommendation remains to use compression but adds a further recommendation to use Erasure coding (EC-X) for data reduction.

Nutanix Best Practices Guide: Virtualizing Microsoft Exchange on Web-Scale Converged Infrastructure.

The reason for these recommendations is three fold:

1. Compression + EC-X give excellent data reduction savings for Exchange which generally result in usable capacity higher than RAW capacity while still providing data protection at the storage layer.

2. Deduplicating data which is deliberately written multiple times is a huge overhead on any infrastructure as data is still processed multiple times by the Guest OS, Storage Network and storage controller even if deplicate copies are not written to disk. To be clear, the Guest OS (CPU) and Storage network overhead are not eliminated by dedupe.

3. Nutanix recommends the use of hybrid nodes for Exchange with a small percentage of capacity provided by SSD (for all write I/O and hot data) and a large percentage of capacity provided by SATA. As a result the bulk of the data is stored on low cost SATA so the commercial benefit ($ per GB) of deduplication is minimal especially after compression and EC-X.

In my opinion deduplicating everything regardless of its profile is not the answer, so data reduction such as deduplication, compression and Erasure Coding should be able to be turned off for workloads which give minimal benefit.

For Exchange DAGs, deduplication should give excellent data reduction results in line with the number of DAG copies. So if an Exchange DAG has 4 copies, then approx 4:1 data reduction should be achieved right off the bat. Now this sounds great but when running a DAG on highly available shared storage (SAN/NAS/HCI) it is unnessasary to have 4 copies of data.

In reality, I recommend 2 copies when running on Nutanix because the shared storage provided by Nutanix keeps at least 1 additional copy (if using EC-X) or where using RF2 or RF3, 2 or 3 copies of data meaning in the event of a drive or node failure, the data is still available to the application without requiring a DAG failover. Similar is true when running Exchange on SAN/NAS/HCI solutions with some form of RAID or replication for data protection.

So the benefit of deduplication would therefore reduce to from possibly 4:1 down to 2:1 because only 2 DAG copies are really required if the storage is highly available.

Considering the data reduction from compression and storage solutions supporting Erasure Coding, I think deduplication is only commercially viable/required when using expensive all flash storage which lets face it, is not required for Exchange.

If you have chosen an all flash solution and you want to run all workloads on it and eliminate having silos of infrastructure for different workloads, then by all means deduplicate Exchange DAGs otherwise it will be a super expensive solution. But, in my opinion hybrid is still the best solution overall with the only real advantage of all flash being potentially higher and more consistent performance depending on many factors.

Summary:

I hope that Microsoft clarify their position regarding support for array level data reduction technologies including deduplication with detailed justifications.

I would be disappointed to see Microsoft come out and update the support policy stating deduplication (for array’s) is not supported as there is not technical reason it should not be supported (Happy to be corrected if credible evidence can be provided) regardless of if you think its a good idea or not.

Having worked in the storage industry for a long time, I have seen many different deduplication solutions used successfully with MS Exchange and I am yet to see any evidence that it is not a totally viable and enterprise grade option for Exchange databases.

The question which remains is, do you need to deduplicate Exchange databases? – My thinking is only where your using all flash systems and need to lower cost per GB.

My position being the better solution would be choose a hybrid solution when eliminating silos which gives you the best of all worlds and applications requiring all flash can have all flash and other workloads can use flash for hot data and lower cost SATA for cold storage or data which doesn’t require SSD (like Exchange).

Peak Performance vs Real World Performance

In this post I will be discussing Real World Performance of Storage solutions compared to peak performance. To make my point I will be using some car analogies which will hopefully assist in getting my point across.

Starting with the Bugatti Veyron Super Sport (below). This car has a W16 engine with 4 turbochargers and produces 1183BHP (~880kW) and has a top speed (peak performance) of 267MPH (431KPH).

bugatti-veyron-super-sport-

The Veyron achieved the world record 267MPH at Volkswagen’s Ehra-Lessien test track in Germany. The test track has a 5.6 mile long straight. This is one of the very few places on earth where the Veyron can actually achieve its peak performance.

Now for the Veyron to achieve the 267MPH, not only do you need a 5.6 mile long straight, but the Veyron’s rear spoiler must NOT be deployed. Now rear spoilers provide down-force to keep stability so having the spoiler down means the car has a reduced ability to for example take corners.

bugatti-veyron-super-sport_100315491_l

In addition to requiring a 5.6 mile long straight, the rear spoiler being down, the Veyron can also only maintain its top speed (Peak performance) for 12 minutes before the Veyron’s 26.4-gallon fuel tank will be emptied, which is lucky because the Veyron’s specially designed tyres only last 15mins at >250MHP.

veyron-tires-2-thumb-550x336

So in reality, while the Bugatti Veyron is one of (if not the fastest) production car in the world, even when you have all your ducks in a row, you can still only achieve its peak performance for a very short period of time (in this example <12 mins) and with several constraints such as reduced ability to corner (due to reduced aerodynamics from the spoiler being down).

Now what about Fuel Economy? The Veyron is rated as follows:

City Driving: 29 L/100 km; 9.6 mpg

Highway Driving: 17 L/100 km; 17 mpg

Top Speed: 78 L/100 km; 3.6 mpg

As you can see, vastly different figures depending on how the Veyron is being used.

There are numerous other factors which can limit the Veyron’s performance, such as weather. For example if the test track is wet, or has strong head winds, the Veyron would not be able to perform at its peak.

bugatti-veyron-wallpaper-7

So while the Veyron can achieve the 267MPH, In the real world, its average (or Real World) performance will be much lower and will vary significantly from owner to owner.

At this stage you’re probably asking “What has this got to do with Storage”?

A Storage solution, be it a SAN/NAS or Hyper-Converged, all can be configured and benchmarked to achieve really impressive Peak Performance (IOPS) much like the Veyron.

But these “Peak Performance” numbers can rarely (if at all) be achieved with “Real World” workloads, especially over an extended duration.

To quote two great guys in the Storage industry (Vaughn Stewart & Chad Sakac):

Absolute performance more often than not, is NOT the only design consideration.

I couldn’t agree with this more. The storage vendors are to blame by advertising unrealistic IOPS numbers based on 4K 100% read and now customers expect the same number of IOPS from SQL or Oracle.

The MPG of the Veyron is like the number of IOPS a Storage array can achieve. It Depends on how the Car or Storage Array is used! The car will get higher MPG if used only on the highway just like a Storage Array will get higher IOPS if only used for one I/O profile.

As the IO size and profile of workloads like SQL & Oracle are vastly different than the peak performance benchmarks using 4K 100% Read IOPS, expecting the same IOPS number for the benchmark and SQL/Oracle is as unrealistic as expecting the Veyron to do 267MPH in heavy traffic.

heavy-traffic-beirut-saidaonline

But like I said, Its the storage vendors fault for failing to educate customers on real world performance so many customers have the impression that peak IOPS is a good measurement, and as a result customers regularly waste time comparing Peak Performance of Vendor A and Vendor B, instead of focusing on their requirements and Real World performance.

In the real world, (at least in the vast majority of cases) customers don’t have dedicated storage solutions for one application where peak performance can be achieved, let alone sustained for any meaningful length of time.

Customers generally run numerous mixed workloads on their storage solutions, everything from Active Directory, DNS , DHCP etc which has low capacity/IOPS requirements , Database, Email and Application servers which may have higher capacity/IOPS requirements to achieve and backup with are low IOPS but high capacity.

Each of these workloads have different IO profiles and depending on storage architecture may share storage controllers / SSDs / HDDs / storage networking all of which can result in congestion / contention which leads to reduced performance.

Before you start considering what vendors storage solution is best, you need to first understand (and document) your requirements along with a success criteria which you can validate storage solutions against.

If your requirements are for example:

  • Host 10TB of Exchange Mailboxes for 2000 users (~400 random Read/Write 32-64k IOPS)
  • Host 20TB Windows DFS solution
  • Host 50TB of Backups
  • Support 1TB active working set SQL Database
  • Host 10TB of misc low IO random workload
  • Have Per VM snapshot / backup / replication capabilities

Then there is no point having (or testing) a solution for 100k Random Read 4k IOPS, as your requirement may be less than 10K IOPS of varying sizes and profile.

Consider this:

If the storage solution/s your considering can achieve the 10K IOPS with the I/O profile of your workloads and can be easily scaled, then a solution able to achieve 20K IOPS day 1, is of little/no advantage to a solution which can achieve 12K IOPS since 10K IOPS is all that you need.

Now if your Constraints are:

  • 12RU rack space
  • 4kw Power
  • $200k

Anything that’s larger than 12RU, uses more than 4Kw of Power or costs more than $200k is not something you should spend your time looking at / benchmarking etc since its not something you can purchase.

So to quote Vaughn and Chad again, “Don’t perform Absurd Testing”. absurdtesting

In my opinion, customers should value their own time enough not to waste time doing a proof of concepts (PoCs) on multiple different products when in reality only 2 meet your requirements.

An example of Absurd testing would be taking a Toyota Corolla on a test drive to a drag strip and testing its 1/4 mile performance when you plan to use the car to pick-up the shopping and drop the kids off at school.

school crossingcarshopping

Its equally as Absurd to test 100% Random Read 4k IOPS or consider/test/compare a storage solutions <insert your favourite feature here> when its not required or applicable to your use case.

Summary:

  1. Peak performance is rarely a significant factor for a storage solution.
  2. Understand and document you’re storage requirements / constraints before considering products.
  3. Create a viability/success criteria when considering storage which validates the solution meets you’re requirements within the constraints.
  4. Do not waste time performing absurd testing of “Peak performance” or “features” which are not required/applicable.
  5. Only conduct Proof of Concepts on solutions:
    1. Where no evidence exists on the solutions capability for your use case/s.
    2. Which fall within your constraints (Cost, Size , Power , Cooling etc).
    3. Which on paper meet/exceed your requirements!
    4. Where you have a documented PoC plan with a detailed success criteria!
  6. As long as the solution your considering can quickly, easily and non-disruptively scale, there is no need to oversize day 1.
    1. If the solution your considering CANT quickly, easily and non-disruptively scale, then its probably not worth considering.
  7. The performance of a storage solution can be impacted by many factors such as compute, network  and applications.
  8. When Benchmarking, do so with tests which simulate the workload/s you plan to run, not “hero” style 100% read 4k (to achieve peak IOPS numbers) or 100% read 256k (to achieve high throughput numbers).