Integrity of I/O for VMs on NFS Datastores – Part 5 – Data Corruption

This is the fifth part of a series of posts covering how the Integrity of Write I/O is ensured for Virtual Machines when writing to VMDK/s (Virtual SCSI Hard Drives) running on NFS datastores presented via VMware’s ESXi hypervisor as a “Datastore”.

This part will focus on Data Corruption.

As a reminder from the first post, this post is not talking about presenting NFS direct to Windows.

So why am I covering data corruption? Simple, because there is a misconception that SCSI commands are not properly supported for VMs running on NFS datastores which leads to corruption. This was covered in Part 1, so Part 5 will focus on data corruption not specific to NFS, but which can effect all storage platforms and how it occurs, then how storage solutions can mitigate the risk of data corruption issues.

The following data is a summary of the data provided in An analysis of data corruption in the storage stack.

Netapp conducted a large scale study into data corruption, which covered >1 Million HDDs across tens of thousands of Netapp systems over 41 months (2004 – 2007) and long story short, Netapp detected a level of data corruption which surprised me and seems to disprove many things like advertised MTBF for HDDs.

The following shows a breakdown of the problems found.

netappfailureanalysis

The first thing I noticed in the above pie charts is the vast difference between the percentage of failures in Enterprise grade disks (left) and nearline based disks (right).

It also shows physical interconnects to be a large percentage of failures, which highlights the need for simplicity in the storage solution. In addition, one of the more surprising results in the level of storage protocol and performance based failures being the cause of corruption.

Note: In this study, the majority of systems deployed were FC (Block storage based) based, this highlights that a storage protocol itself regardless of being block or file based storage, can have issues if improperly implemented. So regardless of storage protocol, corruption can occur.

The below summary of corruption type and percentage of disks effected shows the dramatic 10x more issues with SATA drives compared to Enterprise grade drives.

NLvsEnterprise

The above also shows bit corruptions or Torn Writes effect more disks compared to lost or misdirected writes, which highlights the importance of Torn I/O Protection (covered in Part 4).

The article summarizes in the following points:summary

The main take away from my perspective is:

1. The requirement to have corruption handling mechanisms for any environment running workloads which require data integrity.
2. Data should be spread out (ideally across disks) to minimize the chance of issues.

The article went on to form these conclusions:

conclustion

In Summary:

1. Data corruption can occur on JBOD , enterprise grade storage solutions and everything in between.
2. SATA drives have a much higher rate (~10x) of corruption.
3. Enterprise grade drives are much better from a data integrity perspective.
4. Corruption handling via sector and ideally block based checksums is essential on writes.
5. Using a checksum on Read helps detect corrupted data.
6. Corruption can occur even when no ECC errors are reported by a physical HDD.
7. Any storage protocol implementation can have bugs which can lead to corruption.
8. Backup / Recovery solutions are essential. Reliance solely on primary storage or application level backups using disks puts your data at risk.
9. Solutions solely dependant on application level data protection on disk are at risk of corrupted data being replicated to other active/passive or backup copies.

My final point, in an enterprise grade storage solutions which use checksums to verify data integrity on write and reads, have a much lower risk of data corruption regardless of media type and storage protocol.

JBOD style deployments using SATA drives have a significantly higher risk of data corruption which is contributed to by the SATA drives 10x higher corruption rates and the lack of enterprise grade checksum features found in some shared storage (SAN/NAS) solutions.

Integrity of Write I/O for VMs on NFS Datastores Series

Part 1 – Emulation of the SCSI Protocol
Part 2 – Forced Unit Access (FUA) & Write Through
Part 3 – Write Ordering
Part 4 – Torn Writes
Part 5 – Data Corruption

Nutanix Specific Articles

Part 6 – Emulation of the SCSI Protocol (Coming soon)
Part 7 – Forced Unit Access (FUA) & Write Through (Coming soon)
Part 8 – Write Ordering (Coming soon)
Part 9 – Torn I/O Protection (Coming soon)
Part 10 – Data Corruption (Coming soon)

Related Articles

1. What does Exchange running in a VMDK on NFS datastore look like to the Guest OS?
2. Support for Exchange Databases running within VMDKs on NFS datastores (TechNet)
3. Microsoft Exchange Improvements Suggestions Forum – Exchange on NFS/SMB
4. Virtualizing Exchange on vSphere with NFS backed storage

Integrity of I/O for VMs on NFS Datastores – Part 2 – Forced Unit Access (FUA) & Write Through

This is the second part of this series and the focus of this post is to cover a critical requirement for many applications including MS SQL and MS Exchange (which is designed to work with Block based storage) to operate as designed and to ensure data integrity is support for Forced Unit Access (FUA) & Write Through.

As a reminder from the first post, this post is not talking about presenting NFS direct to Windows.

The key here is for the storage solution to honour the “Write-to-stable” media intent and not depend on potentially vulnerable caching/buffering solutions using non persistent media which may require battery backing.

Microsoft have a Knowledge base article relating to the requirements for SQL Server, which details the FUA & Write Through requirements, along with other requirements covered in this series which I would recommend reading.

Key factors to consider when evaluating third-party file cache systems with SQL Server

Forced Unit Access (FUA) & Write-Through is supported by VMware but even with this support, it is also a function of the underlying storage to honour the request and this process or even support may vary from storage vendor to storage vendor.

A key point here is this process is delivered by the VMDK at the hypervisor level and passed onto the underlying storage, so regardless of the protocol being Block (iSCSI/FCP) or File based (NFS) it is the responsibility of the storage solution once the I/O is passed to it from the hypervisor.

Where a write cache on non persistent media (ie: RAM) is used, the storage vendor needs to ensure that in the event of a power outage there is sufficient battery backing to enable the cache to be de-staged to persistent media (ie: SSD / SAS / SATA).

Some solutions use Mirrored Write Cache to attempt to mitigate the risk of power outages causing issues but this could be argued to be not in compliance with the FUA which intends the Write I/O to be committed to stable media BEFORE the I/O is acknowledged as written.

If the solution does not ensure data is written to persistent media, it is not compliant and applications requiring FUA & Write-Through will likely be impacted at some point.

As I work for a storage vendor, I wont go into detail about any other vendor, but I will have an upcoming post on how Nutanix is in compliance with FUA & Write-Through.

In part three, I will discuss Write Ordering.

Integrity of Write I/O for VMs on NFS Datastores Series

Part 1 – Emulation of the SCSI Protocol
Part 2 – Forced Unit Access (FUA) & Write Through
Part 3 – Write Ordering
Part 4 – Torn Writes
Part 5 – Data Corruption

Nutanix Specific Articles

Part 6 – Emulation of the SCSI Protocol (Coming soon)
Part 7 – Forced Unit Access (FUA) & Write Through (Coming soon)
Part 8 – Write Ordering (Coming soon)
Part 9 – Torn I/O Protection (Coming soon)
Part 10 – Data Corruption (Coming soon)

Related Articles

1. What does Exchange running in a VMDK on NFS datastore look like to the Guest OS?
2. Support for Exchange Databases running within VMDKs on NFS datastores (TechNet)
3. Microsoft Exchange Improvements Suggestions Forum – Exchange on NFS/SMB
4. Virtualizing Exchange on vSphere with NFS backed storage

Rule of Thumb: Sizing for Storage Performance in the new world.

In the new world where storage performance is decoupled with capacity with new read/write caching and Hyper-Converged solutions, I always get asked:

How do I size the caching or Hyper-Converged solution to ensure I get the storage performance I need.

Obviously I work for Nutanix, so this question comes from prospective or existing Nutanix customers, but its also relevant to other products in the market, such as PernixData or any Hybrid (SSD+SAS/SATA) solution.

So for indicative sizing (i.e.: Presales) where definitive information is not available and/or where you cannot conduct a detailed assessment , I use the following simple Rule of Thumb.

Take your last two monthly full backups, and take the delta between them and multiply that by 3.

So if my full backup from August was 10TB and my full backups from September is 11TB, my delta is 1TB. I then multiply that by 3 and we get 3TB which is our assumption of the “Active Working Set” or in basic terms, the data which needs performance. (Because cold or inactive data can sit on any tier without causing performance issues).

Now I  size my SSD tier for 3TB of usable capacity.

The next question is:

Why multiple the backup data delta by 3?

This is based on an assumption (since we don’t have any hard data to go on) that the Read/Write ratio is 70% Read, 30% write.

Now those of you familiar with this thing called Maths, would argue 70/30 is 2.33333 which is true. So rounding up to 3 is essentially a buffer.

I have found this rule of thumb works very well, and customers I have worked with have effectively had All Flash Array performance because the “Active Working Set” all resides within the SSD tier.

Caveats to this rule of thumb.

1. If a customer does a significant amount of deletions during the month, the delta may be smaller and result in an undersized SSD tier.

Mitigation: Review several months of full backup logs and average the delta.

2. If the environment’s Read/Write ratio is much higher than 70/30, then the delta from the backup multiplied by 3 may again result in  an undersized SSD tier.

Mitigation: Perform some investigation into your most critical workloads and validate or correct the assumption of multiplying by 3

3. This rule of thumb is for Server workloads, not VDI.

VDI Read/Write ratio is generally almost opposite to server, and around 30/70 Read/Write. However the SSD tier for VDI should be sized taking into account the benefits of VAAI/VCAI cloning and things like de duplication (for Memory and SSD tiers) which some products, like Nutanix offer.

Summary / Disclaimer

This rule of thumb works for me 90% of the time when designing Nutanix solutions, but your results may vary depending on the platform you use.

I welcome any feedback or suggestions of alternate sizing strategies which I will update the post with where appropriate.