If you want to keep all your eggs, you need more than one basket.
The recent 4 hour outage at Amazons AWS/S3 services can’t have escaped your attention.
On Tuesday morning and employee conducting maintenance tasks took out two S3 subsystems in Northern Virginia. Yes. Two S3 subsystems were completely taken down. The impact of this is yet to be calculated but according to The Register analytics firm Cyence suggests, Standard and Poors 500 companies alone lost some £122M from the downtime, while financial services companies in the US dropped an estimated £130M. It is also estimated to have cost lots of dollars for smartphone app creators relying on ads to be streamed for their revenue model, Nest security cameras were down and the final summary will make interesting reading. Amazon themselves released a summary report into the ‘service disruption’ which can be viewed here.
I quote directly from the report here…
‘At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.’
So, human error was the root cause of this outage.
In a recent blog post we pointed out that Human error is the most common reason for outages, probably the highest of all causes. You can catch up on that here if you missed it (inserts shameless plug).
Additionally, we have reported on recent events where people are relying solely on AWS or Azure to house storage and compute for cost reasons, and yes, for archival storage it is a good solution (compute is another discussion entirely), but it shouldn’t be the only copy of your data.
We all see compliance demanding that we use local data centres for legal reasons but this means we don’t use our cloud provider’s geographical protection and that in itself causes a problem. If you look at the SLA offered by Amazon (you did read the SLA before you signed, didn’t you? Good...) then you’ll find they recommend the geographical separation of services exactly for this reason. In fact, their SLA doesn’t cover outages to a local area in many cases.
Look at the recent case of the Australian Tax Office and Kings College London. Both customers use HP 3PAR storage, both had an issue where the array failed (the KCL one was covered here (shameless plug number two)).
Why worry?
Surely that’s easy to recover from…? - Not in this case, the firmware upgrade didn’t go to plan and flattened the array.
Ok, so rebuild the array and restore from backup - Yep, that would have worked. In most cases that would have worked very well. Except the backups were stored… you guessed it, on the same array. Admittedly, the ATO did have other backups but much slower restoring ones!
So, Amazon taught many people a lesson that will be hard learned. KCL and the ATO underlined it…
3-2-1
3 copies of your data
2 media types
1 offsite
Simple. Effective. Saves your backside.
I wrote about it here (lots of shameless plugs today!)
This brings me neatly on to the topic of SLAs offered by cloud providers.
If you lose your storage and don’t have a backup what happens? Well, don’t expect a payment from Amazon. They will only calculate and offer ‘Service Credits’ or free storage services for the time you were out. If they apply at all.
So, lesson learned. Use Cloud on its own at your peril and if you value your business then make sure you have a backup of everything you have there.
Had your data been on Amazon without a backup - would you still be in business today? What revenue would you have lost per hour?
Is your data on Amazon, or any cloud provider for that matter, without an air gapped ‘Restore as a Service’ copy? If it is then call us now for free impartial advice; OK I admit it, we may insert small plugs for our service. It works, it isn’t expensive and it’ll keep you in business.