Disaster Recovery Without Boundaries Part 2

Posted on

by James Keating III, Business Technology Architect, Evolving Solutions

In my last blog post, I covered a scenario to allow for offsite backups with a simple disaster recovery mechanism in the cloud.  This post I will go into a more complicated scenario where both return to operation (RTO) and recover point objective (RPO) are critical. Before I get into this scenario I will set the stage again on what aspects of cloud computing we will be using to help disaster recovery and also more clearly define RPO and RTO.

 

The attributes of cloud that can be utilized for disaster recovery are as follows:

  • On demand computing
  • Containerized workload
  • Location agnostic
  • Speed of implementation

Each one of these attributes can lend itself nicely to improving disaster recovery abilities in terms of cost (both capital and labor), speed to recovery, ability to automate.

Definitions:

RTO = Return to Operation, or the total time from the time a disaster is declared to when the system/s are operational again.  Key to this concept is, it is not the entire time a system is down in a disaster, but the time from when a disaster is declared (could be hours after the incident started) until the system is back up and running.

RPO = Recovery Point Objective, or the time difference from when the system went offline and the timing of the restored data.  So boiled down this is really the amount of data one is willing to lose.  So an RPO of 30 minutes would say when we get back going again the data will be consistent and intact for a time point of 30 minutes or less before the system went offline, or a loss of 30 minutes of data.

Now onto our scenario for this blog post.

Goals:

  • Meet an RTO of 8 hours and an RPO of 60 minutes
  • Meeting of the above RTO and RPO with 100% of the compute and performance of production
  • All applications for this scenario will be ones that reside on a VMware virtual environment in production
  • Ability to have all of the above without investment in a second data center location infrastructure and management.

Items we will need for this scenario:

  • Containerization of workload – As stated in the goals, we will be using our VMware environment as the base for our systems and applications that are part of this scenario.
  • Backup software – For this scenario we will choose to use the current backup software in use in our environment.  This can be a mix of backup software as this portion of the setup is for protection from corruption and a method to still meet the RTO if we have a corruption event, noting that if we do have a corruption event our RPO of 60 minutes is likely not going to happen, more on this later.  So for this example we will say we are using both IBM TSM and Symantec NetBackup.
  • Replication Agent – For this particular scenario we can also use several methods, from VMware replication to other dedicated software.  To allow for our RTO and RPO we will choose a software that can be both a replication agent and failover automation mechanism, Zerto.
  • Cloud Storage Gateway – for this we will choose a NetApp system that will be used as a backup storage target for both our IBM TSM and Symantec NetBackup, which will then mirror our backup data to our cloud provider.
  • Cloud Provider – Since the goal is 100% of productions horsepower in terms of performance and we need to meet our RTO and RPO, we will need a cloud provider that allows us a much more customizable setup that we would get from the common cloud names one would hear such as AWS or Azure.  For this, since we are also assuming a heterogeneous environment in terms of both Linux and Windows inside our VMware, we will choose PEAK.

Since this is a much more detailed scenario than the previous blog post, I will go into some details and notes below, but this is by no means a complete listing of considerations and requirements as the nature of the applications and backup software will play a huge factor in how this would get rolled out in a real world environment.  If you want to know more you can contact Evolving Solutions to go over your particular scenario.

Notes and items about our scenario and choices:

  • The ability to use software to replicate data from our on premises data center to our cloud data center is reliant upon network considerations and also ease of use.  Zerto was chosen precisely because it is easy to configure and maintain, but it also has automation features that allow for reduced exposure to human interaction risks. Further it is optimized for asynchronous replication of VMware systems.  This will help us in terms of meeting both the RTO and RPO goals.
  • We chose PEAK because we are using VMware.  We need to have a VMware configuration on the cloud side that can be treated just like our on premises VMware.  Peak can provide that to us  in terms of a dedicated private vCenter that we can admin in the same way we do our local VMware.  This is not something you can get from all cloud providers.  Secondly PEAK is native NetApp so this helps us in terms of getting our backup sets to the cloud provider efficiently using a common set of commands.  Third, PEAK offers us a 100% uptime SLA at the hypervisor level, so we can be sure our VMware in the cloud will be available at the time of a disaster.
  • Backup software and backup data, while not the first source of data for us, is still required.  This is because while we are replicating data from our primary to our cloud environment, we run the risk of corruption.  If our data gets corrupted on the primary side,  it will corrupted on the secondary side within seconds to minutes as the replication happens.  The backup copies are an insurance policy against this possibility.  Since we are using backup software we will need to have that software available on both sides and have a method of shared backup sets between the sites.  This is our NetApp storage that both IBM and Symantec will write to as disk based backup locations.  Then if we need to we can bring up the backup applications on the cloud side and read from the cloud storage and begin restores at disk speed.
  • NetApp as the backup target.  This was chosen as it fits the PEAK cloud model nicely and allows for ease of administration in terms of having backup software write to a CIFS or NFS share and have those shares both snapshotted and also replicated to a NetApp system sitting at our PEAK cloud location.  This will make administration of the backup storage identical on both sides of the equation.

So, with this setup we essentially have built a secondary cloud site that looks as similar as possible to our primary site.  We have an equal amount of compute power on both sides.  We have backup software and application licensing on both sides to allow us to run at either side.  We have replication in two flavors running to protect our data from corruption and/or loss.  We will likely have more storage investment in terms of capacity at the cloud location depending on how long we wish to have offsite backups retained as the cloud side will also be utilized as offsite backup storage.  So you can see this scenario is very much a redundant configuration that will have costs in line with that.  What cost savings we will have will be related to facilities in terms of not having to maintain cooling, power and building facilities for our cloud location.

Again this scenario is much more complex than the one described in my first blog post.   I also realize the above details may be too much to absorb while reading a blog post, so if you find yourself with questions do not hesitate to contact Evolving Solutions to go over how your situation would fit into the above model and what real world obstacle and limitations might exist.

________________

James Keating III is a Business Technology Architect for Evolving Solutions. James is a technology leader in the IT community and brings a combination of excellent technical skills and business acumen which allows him to guide customers in developing IT solutions to meet business requirements.