Container Disaster Recovery Strategies

Prev Next

Scope

This document covers DR strategies for the following Telos Alliance container platforms:

  • Cloud
  • Bare Metal On-Prem
  • Infinity VIP Appliance

All systems fail eventually. From power supplies to corrupt files, no platform, whether self-hosted or on a Cloud providers platform, is immune from failure at some point. Employing an effective Disaster Recovery (DR) strategy, including your downtime tolerance (a.k.a. Recovery Time Objective), is an extremely important part of system and solution deployment.

DR tolerance is a business level decision driven by requirements and budget. Your DR strategy needs to balance system resilience targets against your budget. A general rule of thumb is that the tighter your DR tolerance is, the more expensive the deployment.

Key Decisions for YOU to make:

  • What is the Recovery Point Objective (RPO) for this workload? In other words, how much data can be lost in a DR event?
    • Think of this as a snapshot in time that you can restore a failed system from. Is rolling back to a snapshot from 1 day ago acceptable? Is the ability to restore a backup from an hour ago more acceptable?
  • What is the Recovery Time Objective (RTO) for this workload?
    • Think of this as the acceptable amount of time for a system or service to be off-air during a Disaster Recovery event. This includes the total time it takes for a hardware or software component failing, the failure then being detected (manual or automated monitoring), to remediation steps taking place, to the system being back on-air.

Keep in mind that the lower the RPO/RTO goal is, the more expensive the strategy generally becomes.

Backup and Restore

Definition

Backup and Restore is a disaster recovery (DR) strategy that involves periodically creating copies of data and applications, storing them separately, and then using these backups to recover them after a disaster or outage. This is a cost-effective strategy suitable for workloads where a longer recovery time is acceptable.

Solutions

  • Telos Alliance Container Backup and Restore utility (TABR).

    • Offered free of charge to all Telos Alliance customers.
    • Only backs up the files and settings your Telos container deployment needs, so backups are in Kilobytes to low Megabytes in size instead of Gigabytes, which is easier and cheaper to store.
    • Most cost effective DR solutions for supported platforms.
    • If you require assistance in recovering a system from a TABR backup, you MUST have an active SLA with Telos.
    • Offers an RPO down to one hour, but an RPO of 1 day is the default and more than enough for most Backup and Restore strategies.
    • Offers an RTO of minutes to hours (depending on config and manual restore action performed by the end user).
  • Cloud Native and Hypervisor standard backup workflows (AMI/snapshot, Cloud Provider or Hypervisor backup services, etc.) With modern cloud native and Enterprise Hypervisor tooling, you can take snapshots at whatever cadence you want, but it can get costly quickly - esp. on Cloud platforms. Again, an RPO of 1 day is likely right for most Telos workflows.

    • Please note that Telos doesn't directly support the DR infrastructure you may choose to deploy in your Cloud provider or Enterprise Hypervisor account; however, these companies offer extensive documentation, managed services, and paid support plans directly with you at the account level.
    • Can offer the same or better RPO as TABR, but with greater cost and complexity.
    • Through Cloud or Hypervisor tools and automation, this solution can offer the same or better RTO as TABR, but with greater cost and complexity.
    • Snapshots and whole system images like AWS AMI's can be Gigabytes in size. Depending on the backup frequency (RPO) and lifecycle retention policy in place, these multi-GB files can balloon costs.

Redundancy

Solution

Facilitated by a customer purchasing a second set of licenses for the backup system. The redundant system is fully independent, with no knowledge of the other system (e.g. config is not synced between instances). This DR strategy can have an RTO of 10's of seconds to single digit minutes depending on config and cost. The RPO is independent between systems, so not worth defining in this section. Each instance should employ a separate backup and restore DR strategy to recover the failed system, and fall back to the primary instance.

Redundancy (backup) licenses are typically offered at a discounted price, provided the end user enters into a contract with Telos Alliance - agreeing NOT to use the backup licenses outside of failure event (e.g. cannot be used for production on another primary system).

High Availability (HA) - NOT OFFERED BY TELOS AT THIS TIME

Telos Alliance containers do not currently offer support for High Availability (HA). This is on our development roadmap, but we do not have a release date set.


Additional Resources

https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html