Recent outages from Amazon and Google have got me thinking about resiliency in the cloud. When you use a cloud service, whether you are consuming an application (backup, CRM, email, etc), or just using raw compute or storage, how is that data being protected? A lot of companies assume that the provider is doing regular backups, storing data in geographically redundant locations or even have a hot site somewhere with a copy of your data. Here's a hint: ASSUME NOTHING. Your cloud provider isn't in charge of your disaster recovery plan, YOU ARE!

Yes, several cloud providers are offering a fair amount of resiliency built in, but not all of them, so it's important to ask. Even within a single provider, there are different policies depending on the service, for example, Amazon Web Services, which has different policies for EC2 (users are responsible for their own failover between zones) and S3 (data is automatically replicated between zones in the same geo). Here is a short list of questions I would ask your provider about their resiliency:

  • Can I audit your BC/DR plans?
    • Can I review your BC/DR planning documents?
  • Geographically, where are your recovery centers located?
    • In the event of a failure at one site, what happens to my data?
    • Can you guarantee that my data will not be moved outside of my country/region in the event of a disaster?
  • What kinds of service-levels can you guarantee during a disaster?
    • What are my expected/guaranteed recovery time objective (RTO) and recovery point objective (RPO)?
  • What method do you use to backup data (tape, disk, etc)? How often are backups occurring?
    • If I have data loss, what is the protocol for restoring from backup?
    • What is the retention policy for these backups?
    • Where are the backup copies being stored?
  • How resilient is your data center facility?
    • Is it a Tier III or IV equivalent according to the Uptime Institute? 
    • Is it SAS-70 Type II compliant?

I'm sure there are more questions that I haven't thought of, but I think this list is a good starting place. I'd love to get input from all of you. Do you audit your cloud providers for resiliency? What other questions should we be asking?