The ATO 5-day outage, like most outages was completely avoidable.

Posted on May 30, 2017 by Josh Odgers

A while back I saw news about the Australian Tax Office (ATO) having a major outage of their storage solution and recently an article was posted titled “ATO reveals cause of SAN failure” which briefly discusses a few contributing factors for the five-day outage.

The article from ITnews.com.au quoted ATO commissioner Chris Jordan in saying:

The failure of the 3PAR SAN was the result of a confluence of events: the fibre optic cables feeding the SAN were not optimally fitted, software bugs on the SAN disk drives meant stored data was inaccessible or unreadable, back-to-base HPE monitoring tools weren’t activated, and the SAN configuration was more focused on performance than stability or resilience, Jordan said.

Before we get into breaking down the issues, I want to start by saying while this specific incident was with HPE equipment, this is not isolated to HPE and every vendor has had customers suffer similar issues. The major failing in this case, and in the vast majority of failures (especially extended outages), come back to the enterprise architect/s and operations teams failing to do their job. I’ve seen this time and time again, yet only a very small percentage of so called architects have a methodology and an even smaller percentage follow one in any meaningful way on a day to day basis.

Now back to the article, let’s break this down to a few key points.

1. The fibre optic cables feeding the SAN were not optimally fitted.

While the statement is a bit vague, cabling issues are a common mistake which can and should be easily discovered and resolved prior to going into production. As per Nutanix Platform Expert (NPX) methodology, an “Operational Verification” document should outline the tests required to be performed prior to a system going into production and/or following a change.

An example of a simple test is for a Host (Server) or SAN dual connected to an FC fabric to disconnect one cable and confirm connectivity remains, and then replace the cable and disconnect the other cable and again confirm connectivity,

Another simple test is to remove the power from a FC switch and confirm connectivity via the redundant switch then replace the power and repeat on the other FC switch.

Had an Operational Verification document been created to an NPX standard, and subsequently followed prior to going live and after any changes, this cabling issue would highly likely not have been a contributing factor to the outage.

This is an architectural and operational failure. The reason it’s an operational failure is because no engineer worth having would complete a change without an operational verification document/s to follow to validate a successful implementation/change.

2. Software bugs on the SAN disk drives meant stored data was inaccessible or unreadable.

In my opinion this is where the vendor is likely more at fault than the customer, however customers and their architect/s need to mitigate against these types of risks. Again an Operational Verification document should have tests which confirm functionality (in this case, simple read operations) from the storage, during normal and degraded scenarios such as drive pulls (simulating SSD/HDD failures) and drive shelve loss (i.e.: The loss of a bulk number of drives in a shelf, typically between 12 and 24).

Failure scenarios should be clearly documented and the risk/s, mitigation/s and recovery plan all of which needs to be mapped back to the business requirements, e.g.: Recovery Time Objective (RTO), Recovery Point Objective (RPO).

Again, this is both an architectural and operational failure as the architect should have documented/highlighted the risks as well as mitigation and recovery strategy, while the engineers should never have accepted a solution into BAU (Business as Usual) operations without these documents.

3. “Back-to-base HPE monitoring tools weren’t activated”

There is no excuse for this, and the ATOs architects and to a lesser extent the operational team need take responsibility here. While a vendor should continually be nagging customers to enable these tools, any enterprise architect worth having mandates monitoring tools sufficient to ensure continuous operation of the solution they design. The Operation Verification document would also have steps to test monitoring tools and ensure the alerting and call home functionality is working both before going into production and at scheduled intervals to ensure continued operation.

This is yet another architectural and operational failure.

4. SAN configuration was more focused on performance than stability or resilience.

This not only doesn’t surprise me but highlights a point I have raised for many years being there is a disproportionately high focus on performance, specifically peak performance, compared to data integrity, resiliency and stability.

In 2015 I wrote “Peak Performance vs Real World Performance” after continuously having to have these discussions with customers. The post covers the topic is reasonable depth but some of the key points are:

Peak performance is rarely a significant factor for a storage solution.
Understand and document your storage requirements / constraints before considering products.
Create a viability/success criteria when considering storage which validates the solution meets your requirements within the constraints.

In this case the architect/s who designed the solution had tunnel vision around performance, when the solution likely didn’t need to be configured in such a way to meet the requirements assuming they were well understood and documented/validated.

If the SAN needed to be configured in the way it did to meet the performance requirements, then it was simply the wrong solution because it was not configured to meet the other vastly more important requirements around availability, resiliency and recoverability and the solution was certainly not validated against any meaningful criteria before going into production or many of these issues would not have occurred, or in the unlikely event of multiple concurrent failures, the recoverability requirements were not designed for or understood sufficiently.

This is again an architectural and operational failure.

ATO commissioner Chris Jordan also stated:

While only 12 of 800 disk drives failed, they impacted most ATO systems.

This means the solution was designed/configured with a tolerance for just 1.5% of drives to fail before a catastrophic failure would occur. This in my mind is so far from a minimally viable solution it’s not funny. What’s less funny is that this fact is unlikely to have been understood by the ATO, which means the failure scenarios and associated risks were not documented and mitigated in any meaningful way.

As an example, in even a small four node Nutanix solution with just 24 drives, an entire nodes worth of drives (6) can be lost concurrently (that’s 25%) without data loss or unavailability. In a 5 node Nutanix NX-8150 cluster with RF3, up to 48 drives (of a total 120, which is 40%) can be lost without data loss or unavailability, and the system can even self-heal without hardware replacement to restore resiliency automatically so further failures can be tolerate. This kind of resiliency/recoverability is essential for modern datacenters and something that would have at least mitigated or even avoided this outage altogether.

But this isn’t a product pitch, this is an example of what enterprise architects need to consider when choosing infrastructure for a project, i.e.: What happens if X,Y and/r Z fails and how does the system recover (i.e. Manually, Automatically etc).

Yet another thing which doesn’t surprise me in the fact failure domains do not appear to have been considered as the recovery tools were located on the SAN in which they were required to protect.

Additionally, some of the recovery tools that were required to restore the equipment were located on the SAN that failed.

It is critical to understand failure scenarios!! Wow I am sounding like a broken record but the message is simply not getting through to the majority of architects.

Recovery/management tools are no use to you when they are offline. If they are on the same infrastructure that requires the tools to be online to be able to recover, then your solutions recoverability is at high risk.

Yet another architectural failure followed by an operations team failure for accepting the environment and not highlighting the architecture failures.

In most, if not all enterprise environments, separate management clusters using storage from a separate failure domain is essential. It’s not a “nice to have”, it’s essential. It is very likely the five-day outage would have been reduced, or at least the cause been diagnosed much faster had the ATO had a small, isolated management cluster running the tooling required to diagnose the SAN.

The article concludes with a final quote from ATO commissioner Chris Jordan:

The details are confidential, he said, but the deal recoups key costs incurred by the ATO, and gives the agency new and “higher-grade” equipment to equip it with a “world-class storage network.

I am pleased the vendor (in this case HPE) has taken at least some responsibility and while the details are confidential, from my perspective higher grade equipment and world class storage network mean nothing without an enterprise architect who follows a proven methodology like NPX.

If the architect/s don’t document the requirements, risks, constraints and assumptions and design a solution with supporting documentation which map the solution back to these areas and then document a comprehensive Operational verification procedures for moving into production and for subsequent changes before declaring a change successful, the ATO (and other customers in similar positions) are destined to repeat the same mistakes.

If anyone from the ATO happens to read this, ensure your I.T team have a solid methodology for the new deployment and if they don’t feel free to reach out and I’ll raise my hand to get involved and lead the project to a successful outcome following NPX methodology.

In closing, everyone involved in a project must take responsibility. If the architect screws up, the ops team should call it out, if the ops team call it out and the project manager ignores it, the ops team should escalate. If the escalation doesn’t work, document the issues/risks and continue making your concerns known even after somebody accepts responsibility for the risk. After all, a risk doesn’t magically disappear when a person accepts responsibility, it simply creates a CV generating event for that person when things do go wrong and then the customer is still left up the creek without a paddle.

It’s long overdue so called enterprise architects live up to the standard at which they are (typically) paid. Every major decision by an architect should be documented to a minimum of the standard shown in my Example Architectural Decision section of this blog as well as mapped back to specific customer requirements, risks, constraints and assumptions.

For the ATO and any other customers, I recommend you look for architects with proven track records, portfolios of project documentation which they can share (even if redacted for confidentiality) as well as certifications like NPX and VCDX which require panel style reviews by peers, not multiple choice exams which are all but a waste of paper (e.g.: MCP/VCP/MCSE/CCNA etc). The skills of a VCDX/NPX are transferable to non-VMware/Nutanix environments as it’s the methodology which forms most of the value, the product experience from these certs still has value is also transferable as learning new tech is much easier than finding a great enterprise architect!

And remember, when it comes to choosing an enterprise architect…

The balancing act of choosing a solution based on Requirements & Budget Constraints

Posted on January 20, 2016 by Josh Odgers

I am frequently asked how to architect solutions when you have constraint/s which prevent you meeting requirements. It is not unusual for customers to have an expectation that they need and can get the equivalent of a Porsche 911 turbo for the price of a Toyota Corolla.

It’s also common for less experienced architects to focus straight away on budget constraints before going through a reasonable design phase and addressing requirements.

If you are constrained to the point you cannot afford a solution that comes close to meeting your requirements, the simple fact is, that customer is in pretty serious trouble no manner how you look at it. But this is rarely the case in my experience.

Typically customers simply need to sit down with an experienced architect and go through what business outcomes they want to achieve. Then the architect will ask a range of questions to help clarify the business goals and translate those into clearly defined Requirements.

I also find customers frequently think they need (or have been convinced by a vendor) that they need more than what they do to achieve the outcomes, e.g.: Thinking you need Active/Active datacenters with redundant 40Gb WAN links when all you need is VM High Availability and async rep to DR on 1Gb WAN links.

So my point here is always start with the following:

Step 1: What is the business problem/s the customer is trying to solve.

Until the customer, VAR , Solution Architect and vendor/s understand this and it is clearly documented, do not proceed any further. Without this information, and a clear record of what needs to be achieved, the project will likely fail.

At this stage its important to also understand any constraints such as Project Timelines, CAPEX budget & OPEX budget.

Step 2: Research potential solutions

Once you have a clear understanding of the problem, desired outcome / requirements, then its time for you to research (or engage a VAR to do this on your behalf) what potential products could provide a solution that addresses the business outcome, requirements etc.

Step 3: Provide VAR and/or Vendors detailed business problem/s, requirements and constraints.

This is where the customer needs to take some responsibility. A VAR or Vendor who is kept in the dark is unlikely to be able to deliver anything close to the desired outcome. While customers don’t like to give out information such as budget, without it, it just creates more work for everyone, and ultimately drives up the cost to the VAR/vendor which in turn gets passed onto customers.

With a detailed understanding of the desired business outcomes, requirements and constraints the VAR/vendor should provide a high level indicative solution proposal with details on CAPEX and ideally OPEX to show a TCO.

In many cases the lowest CAPEX solution has the highest OPEX which can mean a significantly higher TCO.

Step 4: Customer evaluates High level Solution Proposal/s

The point here is to validate if the proposed solution will provide the desired business outcome and if it can do so within the budget. At this stage it is importaint to understand how the solution meets/exceeds each requirement and being able to trace a design back to the business outcomes. This is why it’s critical to document the requirements so the high level design (and future detailed design) can address each criteria.

If the proposed solution/s do meet/exceed the business requirements, then the question is, Do they fall within the allocated budget?

If so, great! Choose your preferred vendor solution and proceed.

If not, then a proposal needs to be put to the business for additional budget. If that is approved, again great and you can proceed.

If additional CAPEX/OPEX cannot be obtained then this is where the balancing act really gets interesting and an experienced architect will be of great value.

Step 5: Reviewing Business outcomes/requirements (and prioritise requirements!)

So lets say we have three requirements, R001, R002 and R003. All are importaint, but the simple fact at this point is the budget is insufficient to deliver them all.

This is where a good solution architect sets the customers expectations as a trusted adviser. The expectation needs to be clearly set that the budget is insufficient to deliver all the desired business outcomes and the priorities need to be set.

Sit with the customer and put all requirements into priority order, then its back and forth with the vendors to provide a lower cost (CAPEX/OPEX or both) detailing the requirement which have and have not been met and any/all implications.

In my opinion the key in these situations is not to just “buy the best you can” but to document in detail what can and cannot be achieved with detailed explanations on the implications. For example, an implication might be the RPO goes from 1hr to 8hrs and the RTO for a business critical application extended from 1hr to 4hrs. In day to day operations the customer would likely not know the difference, but if/when an outage occurred, they would know and need to be prepared. The cost of a single outage can in many cases cost much more than the solution itself, which is why its critical to document everything for the customer.

In my experience, where the implications are significant and clearly understood by the customer (at both a technical and business level), it is not uncommon for customers to revisit the budget and come up with additional funds for the project. It is the job of the Solution Architect (who should always be the customer advocate) to ensure the customer understands what the solution can/cannot deliver.

Where the customer understands the implications (a.k.a Risks), it is importaint that they sign off on the risks prior to going with a solution that does not meet all the desired outcomes. The risks should be clearly documented ideally with examples of what could happen in situations such as failure scenarios.

Basic Flowchart of the above described process:

The below is a basic flowchart showing a simplified process of gathering business requirements/outcomes through to an outcome.

Starting in the top left, we have the question “Are the business problem/s you are trying to solve and the requirements documented?

From there its a follow the bouncing ball until we come to the dreaded “Is the solution/s within the budget constraints?”. This area is coloured grey for a reason, as this is where the balancing act occurs and delays can happen as indicated with the following shape .

Once the customer fully understand what they are or are not getting AND IT IS FULLY DOCUMENTED, then and only then should a customer (or VAR/Architect recommend too) proceed with a proposal.

The above is not an exhausting process, but just something to inspire some thought next time you or your customers are constrained by budget.

Summary:

Don’t just “Buy what you can afford and hope for the best”
Document how requirements are (or are not) being met (That’s “Traceability” for all you VCDX/NPX candidates)
Document any risks/implications and mitigation strategies
Get sign off on any requirements which is not met
Always provide a customer with options to choose from, don’t assume they wont invest more to address risks to the business
If you can’t meet all requirements Day 1, ensure the design is scalable. A.k.a Start small and scale if/when required.
Avoid low cost solutions that will likely end up having to be thrown out

Related Articles:

Things to consider when choosing infrastructure.

Posted on April 22, 2015 by Josh Odgers

With all the choice in the compute/storage market at the moment, choosing new infrastructure for your next project is not an easy task.

In my experience most customers (and many architects) think about the infrastructure coming up for replacement and look to do a “like for like” replacement with newer/faster technology.

An example of this would be a customer with a FC SAN running Oracle workloads where the customer or architect replaces the end of life Hybrid FC SAN with an All Flash FC SAN and continues running Oracle “as-is”.

Now I’m not saying there is anything wrong with that, however if we consider more than just the one workload, we may be able to achieve our business requirements with a more standardized and cost effective approach than having dedicated infrastructure for specific workloads.

So in this post, I am inviting you to consider the bigger picture.

If we take an example customer has the following workload requirements:

Virtual Desktop (VDI)
Virtualized Business Critical Applications (e.g.: SQL / Exchange)
Long Term Archive (High Capacity, low IOPS)
Business Continuity and Disaster Recovery

It is unlikely any one solution from any vendor is going to be the “best” in all areas as every solution has its pros and cons.

Regarding VDI, I would say most people would agree Hyperconverged Infrastructure (HCI) / Scale out type architectures are strong for VDI, however VDI can be successfully deployed on a traditional SAN/NAS solutions or using non shared local storage in the case of non-persistent desktops.

For vBCA, some people believe physical servers with JBOD storage is best for workloads like Exchange, and Physical + local SSD are best for Databases while many people are realising the benefits of virtualization of vBCA with shared storage such as SAN/NAS or on HCI.

For long term archive, cost per GB is generally one of if not the most critical factor where lots of trays of SATA storage connected to a small dual controller setup may be the most cost effective, whereas an All Flash array would be less likely considered in this use case.

For BC/DR, features such as a Storage Replication Adapter (SRA) for VMware Site Recovery Manager, a stretched cluster capability and some form of snapshot capability and replication would be typical requirements. Some newer technology can do per VM snapshots, whereas older style SAN/NAS technology may be per LUN, so newer technology would have an advantage here, but again, this doesn’t mean one tech should not be considered.

So what product do we choose for each workload type? The best of breed right?

Well, maybe not. Lets have a look at why you might not want to do that.

The below graph shows an example of 3 vendors being compared across the 4 categories I mentioned above being VDI, vBCA, Long Term Archive and BC/DR.

The customer has determined that a score of 3 is required to meet their requirements so a solution failing to achieve a 3 or higher will not be considered (at least for that workload).

As we can see, for VDI Vendor B is the strongest, Vendor A second and Vendor C third, but when we compare BC/DR Vendor C is strongest followed by Vendor A and lastly Vendor B.

We can see for Long Term Archive Vendor A is the strongest with Vendor B and C tied for second place and finally for vBCA Vendor B is the strongest, Vendor A second and Vendor C third.

So if we chose the best vendor for each workload type (or the “Best of breed” solution) we would end up with three different vendors equipment.

VDI: Vendor B
Long Term Archive: Vendor A
BC/DR: Vendor C
vBCA: Vendor B

Is this a problem? Not necessarily but I would suggest that there are several things to consider including:

1. Having 3 different platforms to design/install/maintain

This means 3 different sets of requirements, constraints, risks, implications need to be considered.

Some large organisations may not consider this a problem, because they have a team for each area, but isn’t the fact the customer has to have multiple teams to manage infrastructure a problem in itself? Sounds like a significant (and potentially unnecessary) OPEX to me.

2. The best BC/DR solution does not meet the minimum requirements for the vBCA workloads.

In this example, the best BC/DR solution (Vendor C) is also the lowest rated for vBCA. As a result, Vendor C is not suitable for vBCA which means it should not be considered for BC/DR of vBCA. If Vendor C was used for BC/DR of the other workloads, then another product would need to be used for vBCA adding further cost/complexity to the environment.

3. Vendor A is the strongest at Long Term Archive, but has no interoperability with Vendor B and C

Due to the lack of interoperability, while Vendor A has the strongest Archiving solution, it is not suitable for this environment. In this example, the difference between the strongest Long Term Archive solution and the weakest is very small so Vendor B and C also meet the customers requirements.

4. Multiple Silos of infrastructure may lead to inefficient use.

Just like in the days before Virtualization, we had the bulk of our servers CPU/RAM running at low utilization levels, we had our storage capacity carved up where we had lots of free space in one RAID pack but very little free space in others and we spent lots of time migrating workloads from LUN 1 to LUN 2 to free up capacity for new or existing workloads.

If we have 3 solutions, we may have many TB of available capacity in the VDI environment but be unable to share it with the Long Term Archiving. Or we may have lots of spare compute in VDI and be unable to share it with vBCA.

Now getting back to the graph, the below is the raw data.

What we can see is:

Vendor B has the highest total (17.1)
Vendor A has the second highest total (14.8)
Vendor C has the lowest total (12)
Vendor C failed to meet the minimum requirements for VDI & vBCA
Vendor A and B met the minimum requirements for all areas

Let’s consider the impact of choosing Vendor B for all 4 workload types.

VDI – It was the highest rated, met the minimum requirements for the customer and is best of breed, so in this case Vendor B would be a solid choice.

vBCA – Again Vendor B was the highest rated, met the minimum requirements for the customer and is best of breed, so Vendor B would be a solid choice.

Long Term Archiving: Vendor B was equal last, but importantly met the customer requirements. Vendor A’s solution may have more features and higher performance, but as Vendor B met the requirements, the additional features and/or performance of Vendor A are not required. The difference between Vendor A (Best of Breed) and Vendor B was also minimal (0.5 rating difference) so Vendor B is again a solid choice.

BC/DR: Vendor B was the lowest rated solution for BC/DR, but again focusing on the customers requirements, the solution exceeded the minimum requirement of 3 comfortably with a rating of 4.2. Choosing Vendor B meets the requirements and likely avoids any interoperability and/or support issues, meaning a simpler overall solution.

Let’s think about some of the advantages for a customer choosing a standard platform for all workloads in the event a platform meets all requirements.

1. Lower Risk

Having a standard platform minimizes the chance of interoperability and support issues.

2. Eliminating Silos

As long as you can ensure performance meets requirements for all workloads (which can be difficult on centralized SAN/NAS deployments) then using a standard platform will likely lead to better utilization and higher return on investment (ROI).

3. Reduced complexity / Single Pane of Glass Management

Having one platform means not having to have SMEs in multiple technologies, or in larger organisations multiple SMEs per technology (for redundancy and/or workload) meaning reduced complexity, lower operational costs and possibly centralized management.

4. Lower CAPEX

This will largely depend on the vendor and quantity of infrastructure purchased, however many customers I have worked with have excellent pricing from a vendor as a result of standardizing.

Summary:

I am in no way saying “One size fits all” or that “every problem is a Nail” and recommending you buy a hammer. What I am saying is when considering infrastructure for your environment (or your customers), avoid tunnel vision and consider the other workloads or existing infrastructure in the environment.

In many cases the “Best of Breed” solution is not required and in fact implementing that solution may have significant implications in other areas of the environment.

In other cases, workloads may be so mission critical, that a best of breed solution may be the only way to meet the business requirements, in which case, a using a standard platform that may not meet the requirements would not be advised.

However if you can meet all the customer requirements with a standard platform while working within constraints such as budget, power, cooling, rack space and time to value, then I would suggest your doing yourself (or your customer) a dis-service by not considering using a standard platform for your workloads.

Related Articles:

1. Enterprise Architecture & Avoiding tunnel vision.

2. Peak Performance vs Real World Performance

CloudXC

By Josh Odgers – VMware Certified Design Expert (VCDX) #90

Category Archives: Architectural Decisions

The ATO 5-day outage, like most outages was completely avoidable.

The balancing act of choosing a solution based on Requirements & Budget Constraints

Things to consider when choosing infrastructure.

Share this:

Share this:

Share this: