Version: 17.x

Managing Disaster Recovery in an Amazon EKS Teleport Cluster

In case of an outage in your cloud provider region, you need to ensure that you can restore your Teleport cluster to working order. This guide provides an overview of a disaster recovery approach for self-hosted Teleport clusters.

The guide assumes that your self-hosted Teleport cluster runs on Amazon Elastic Kubernetes Service and uses the teleport-cluster Helm chart. The teleport-cluster Helm chart is the recommended approach for quickly getting started with self-hosting a Teleport cluster on Kubernetes, and you can read about how to get started with the chart in Deploy Teleport on Kubernetes.

How it works

In the approach we explain in this guide, AWS backs up the Teleport Auth Service backends to a secondary region. If the primary region becomes unavailable due to an outage, an admin redeploys the cluster to the secondary region, configuring the Teleport Auth Service to connect to new backends in the region.

Since Teleport certificate authorities are already backed up in the new region, running Teleport Agents and bots outside the unavailable region do not need to reconnect to the cluster. In this disaster recovery scenario, the recovery time objective depends on the time it takes to redeploy the Auth Service and Proxy Service in the new region, as well as the time to live (TTL) of the DNS records for the Teleport Proxy Service.

Prerequisites

Your self-hosted Teleport cluster was launched using the teleport-cluster Helm chart. We recommend reading Deploying a High Availability Teleport Cluster for a high-level architectural outline of a self-hosted Teleport cluster.
You are using Amazon DynamoDB for the cluster state backend and audit event backend, and using Amazon S3 for your session recording backend. For information on selecting Teleport backends, see Storage Backends.

danger

This guide is not intended to be a runbook for a regional outage.

Read this guide to help prepare your disaster recovery plan, including any runbooks and automation. We strongly recommend that you regularly test your plan to prevent issues.

Step 1/4. Back up Auth Service backends

The first step in setting up a disaster recovery procedure for your Teleport cluster on Amazon EKS is to back up your Teleport Auth Service backends to a secondary region. If the primary region becomes unavailable, the backend replicas in the secondary region will be ready for your new cluster to connect to once you redeploy it in the secondary region.

The recovery point objective for a Teleport cluster during a regional outage depends on how frequently you back up the Auth Service backend. The more frequent the backups, the fewer backend changes you will lose when you restore the cluster during a disaster.

Cluster state backend

With a backup of the cluster state backend, the Teleport Auth Service can use its existing certificate authorities to sign certificates for cluster components. If you restore a Teleport cluster with a new backend, and do not have a backup, you will need to configure cluster components, such as self-hosted databases protected by Teleport, to trust the new CAs.

Read the Amazon DynamoDB documentation to plan your backup procedure.

AWS Key Management Service users

If the Teleport Auth Service in your cluster uses Amazon Key Management Service for certificate authority private keys, you must replicate your keys to the new AWS region before reinstalling the teleport-cluster Helm chart. See the AWS documentation for information about using KMS keys in multiple regions.

Note that KMS support is only possible in the teleport-cluster Helm chart using the auth.teleportConfig values field (chart reference, and is not recommended for most teleport-cluster users.

Audit event backend

In addition to the cluster state backend, you must also back up the audit event backend to a secondary region in order to retain access to audit events. As with the cluster state backend read the Amazon DynamoDB documentation to plan your backup procedure.

Session recording backend

To retain access to your session recordings in the case of region-wide unavailability, you need to back up your session recordings to the secondary region. When you redeploy the Teleport Auth Service to the new region, you can configure it to connect to an S3 bucket in the new region.

You can use S3 replication rules to create continuous backups from your primary region to the backup region. Follow the Multi-Region Blueprint guide to plan multi-region S3 bucket replication for your session recording backend.

Step 2/4. Stop the existing cluster

When you detect a zonal or regional outage, it is important that you cleanly stop the existing Teleport cluster. While outages at the cloud provider level disrupt the expected functioning of a Teleport cluster, some operations in your cluster may continue successfully. After the outage has concluded, services that remained functional or come back online can cause the cluster to behave in unexpected ways.

For example, Teleport Agents can remain connected to a Teleport cluster even after an outage has caused it to become unavailable. This is because agents establish long-lived gRPC connections through the Proxy Service, and maintain these even when the target Proxy Service instances have become unresponsive. You can force-close the connections between agents and the Proxy Service by stopping and restarting the Proxy Service.

If it is possible to do so, stop any Teleport Auth Service and Proxy Service pods in your cluster. This command assumes that the name of your release is teleport-cluster and that it runs in the teleport namespace:

helm --namespace teleport uninstall teleport-cluster

Note that, in the case of a full regional outage, your existing cluster may already be unavailable.

Step 3/4. Relaunch the Auth Service and Proxy Service

Now that there is a backup of the Auth Service backend, you are safe to deploy the Teleport cluster in a new region.

Update your values file

When deploying the Teleport Auth Service and Proxy Service to the new region, you need to update the following fields in the values file for the teleport-cluster Helm chart:

aws.region
aws.backendTable
aws.auditLogTable
aws.sessionRecordingBucket

You may need to update other values to match third-party dependencies of the Teleport cluster. For example, if you plan to deploy cert-manager to handle Teleport Proxy Service certificates instead of Let's Encrypt, you need to set the certManager.issuerName field to match the name of the cert-manager Issuer in your Kubernetes cluster.

We recommend reading Running an HA Teleport Cluster Using AWS, EKS, and Helm so you can make sure that you have accounted for any third-party dependencies you need to manage in the new AWS region.

Update IAM configurations

You also need to ensure that the roles used by the Auth Service and Proxy Service grant the services permissions to access backends in the new region. Make sure that the trust policies associated with these roles grant access to principals in the new region.

Provision credentials to the load balancer

The teleport-cluster Helm chart deploys a Kubernetes service that, by default, has the LoadBalancer type. Reinstalling the Helm chart in your new AWS region recreates the load balancer.

If you are using AWS Certificate Manager or cert-manager to provision TLS credentials for your load balancer, you must create a new certificate and private key before installing the chart in the new region.

For guidance on configuring your EKS cluster to use ACM and cert-manager with the Teleport Proxy Service load balancer, see Configure TLS certificates for Teleport.

Reinstall the Helm chart

Install the teleport-cluster Helm chart using the new values:

helm install teleport-cluster teleport/teleport-cluster \  --version 17.5.2 \  --values teleport-cluster-values.yaml

After installing the teleport-cluster chart, wait a minute or so and ensure that both the Auth Service and Proxy Service pods are running:

kubectl get pods
NAME                                      READY   STATUS    RESTARTS   AGEteleport-cluster-auth-000000000-00000     1/1     Running   0          114steleport-cluster-proxy-0000000000-00000   1/1     Running   0          114s

Once the Teleport Auth Service is running on the new cluster, make sure that you have applied Teleport dynamic resources against the new cluster so the Teleport Kubernetes operator can manage them.

You should manage your Teleport resources as a set of Kubernetes manifests applied using a GitOps tool like Flux CD so that, when you launch the new cluster, you can readily apply them.

warning

Note that, by the time you have redeployed the Auth Service and Proxy Service, any Machine & Workload Identity bots that have joined the cluster with a static token will likely miss their renewal periods and become locked out of the cluster until the Teleport Auth Service can issue new tokens. We recommend using delegated join methods to prevent this scenario.

Step 4/4. Update DNS records

Once you have launched your Teleport cluster in the new AWS region, you must ensure that Route 53 DNS records point to the new cluster.

Ensure that existing DNS records for the Teleport Proxy Service in your cluster have a low TTL, e.g., one minute. We expect that DNS resolvers in your users' networks honor the TTLs of DNS records to prevent issues with propagating records.

The teleport-cluster Helm chart exposes the Proxy Service to traffic from the internet using a Kubernetes service that sets up an external load balancer with your cloud provider.

Obtain the address of your load balancer by following the instructions below.

Get information about the Proxy Service load balancer:
```
kubectl get services/teleport-cluster
NAME              TYPE           CLUSTER-IP   EXTERNAL-IP      PORT(S)                        AGEteleport-cluster  LoadBalancer   10.4.4.73    192.0.2.0        443:31204/TCP                  89s
```
The teleport-cluster service directs traffic to the Teleport Proxy Service. Notice the EXTERNAL-IP field, which shows you the IP address or domain name of the cloud-hosted load balancer. For example, on AWS, you may see a domain name resembling the following:
```
00000000000000000000000000000000-0000000000.us-east-2.elb.amazonaws.com
```
Set up two DNS records: teleport.example.com for all traffic and *.teleport.example.com for any web applications you will register with Teleport. We are assuming that your domain name is example.com and teleport is the subdomain you have assigned to your Teleport cluster.

Depending on whether the EXTERNAL-IP column above points to an IP address or a domain name, the records will have the following details:
- IP Address
- Domain Name
Record Type Domain Name Value
A teleport.example.com The IP address of your load balancer
A *.teleport.example.com The IP address of your load balancer

Record Type	Domain Name	Value
A	`teleport.example.com`	The IP address of your load balancer
A	`*.teleport.example.com`	The IP address of your load balancer

Record Type	Domain Name	Value
CNAME	`teleport.example.com`	The domain name of your load balancer
CNAME	`*.teleport.example.com`	The domain name of your load balancer

Once you create the records, use the following command to confirm that your Teleport cluster is running:

curl https://clusterName/webapi/ping
{"auth":{"type":"local","second_factor":"on","preferred_local_mfa":"webauthn","allow_passwordless":true,"allow_headless":true,"local":{"name":""},"webauthn":{"rp_id":"teleport.example.com"},"private_key_policy":"none","device_trust":{},"has_motd":false},"proxy":{"kube":{"enabled":true,"listen_addr":"0.0.0.0:3026"},"ssh":{"listen_addr":"[::]:3023","tunnel_listen_addr":"0.0.0.0:3024","web_listen_addr":"0.0.0.0:3080","public_addr":"teleport.example.com:443"},"db":{"mysql_listen_addr":"0.0.0.0:3036"},"tls_routing_enabled":false},"server_version":"17.5.2","min_client_version":"12.0.0","cluster_name":"teleport.example.com","automatic_upgrades":false}

Guidance

Assuming that you have planned your disaster recovery strategy around the steps we lay out in this guide, we recommend the following practices.

Testing your disaster recovery procedure

We strongly recommended testing your disaster recovery plan. Schedule time to stop your Teleport cluster in one region and redeploy in another one using your backup of the Auth Service cluster state backend. After redeploying your cluster, ensure that users can continue to connect to Teleport-protected resources.

Common causes of disaster recovery failures include:

Misconfigured IAM settings: The Teleport Auth Service in your first region can access its backend, for example, but in the second region, the Auth Service has a role with insufficient permissions.
Misconfigured backend connections: The Teleport Auth Service is configured with an incorrect cluster state backend URL in the new region, meaning that when it starts up, it fails to retrieve its existing CAs and bootstraps instead with a self-signed certificate.

Shortening the recovery time objective

When planning a Teleport disaster recovery plan, the main consideration for estimating a recovery time objective is how long it will take for the Teleport Auth Service to come online in your secondary region.

You can expect the disaster recovery procedure outlined in this guide to take at least an hour, though the precise details depend on your infrastructure and organization. To arrive at an exact benchmark, we strongly recommend testing your disaster recovery procedure.

You can take measures to shorten the recovery time objective of your disaster recovery procedure. Possibilities include:

Reduce the TTL of the DNS records for the Teleport Proxy Service. If this is longer than the time it takes to redeploy your cluster to a new region, clients may continue to connect to the cluster in the previous region.
Restore your cluster state and audit event backend tables from backup prior to reinstalling the teleport-cluster Helm chart, so the Teleport Auth Service does not need to initialize any tables itself.
Run the Auth Service, Proxy Service, and backend services in the secondary region before any regional outage takes place. When there are redundant Teleport cluster deployments across multiple regions, there is no need to lose availability while waiting for your cluster to deploy to a new region. Read about the architecture of a multi-region Teleport deployment in the Multi-Region Blueprint guide.

Imposing a change freeze

During a regional cloud provider outage, the procedure we outline in this guide includes stopping the Teleport cluster in the affected region. Until the cluster comes back online, it is impossible for users to update dynamic resources or rotate the cluster CAs. In some cases, though, you may need to impose a change freeze to prevent users from updating cluster resources while you restore a cluster from a backup.

You can impose a change freeze by locking Teleport roles that include permissions to edit dynamic resources. Teleport roles allow access to modifying dynamic resources with the spec.allow.rules field.

You can use the following jq command to list all roles with one or more rule that grants permissions to modify backend resources. This example assumes that there is a role called locksmith that allows the user to create, list, read, and delete locks. It skips the locksmith user so you can remove the lock after restoring the cluster:

tctl get roles --format json | jq -r '.[] | select(.metadata.name != "locksmith" and .spec.allow.rules) | .metadata.name'
dashboard-admindashboard-userdevice-admindevice-enrollgroup-access

Teleport grants access to certificate authority rotations using the spec.allow.rules field - on the cert_authority resource - so using role locking to impose a change freeze will also unintended certificate authority rotations during the disaster recovery procedure.

Create a lock using a tctl lock command. The following example locks the contractor role for 24 hours:

tctl lock --role=contractor --message="change freeze during disaster recovery" --ttl=24h

Note that users with permissions to execute commands directly on the Auth Service pod can still make changes to the backend using tctl.

Once it is safe for users to modify cluster resources, use the tctl rm lock/<lock_id> command to remove each lock. If you provided each lock with a similar --message flag value when creating the locks, you can remove all locks you created with a single command. This command removes all locks with a message that includes the substring change freeze:

tctl get lock --format=json \  | jq -r '.[] | select(.spec.message and (.spec.message | contains("change freeze"))) | .metadata.name' \  | xargs -I{} tctl rm lock/{}

Teleport Infrastructure Identity Platform

Featured Resource

Protected Resources

Featured Resource

Engineering Velocity

Infrastructure Resiliency

AI & Infrastructure Identity

Industries

Featured Resource

Compliance

Featured Resource

Strategic Partners

Featured AWS Webinar

Featured Blog Post

Featured Event

From Our Partners

How it works​

Prerequisites​

Step 1/4. Back up Auth Service backends​

Cluster state backend​

Audit event backend​

Session recording backend​

Step 2/4. Stop the existing cluster​

Step 3/4. Relaunch the Auth Service and Proxy Service​

Update your values file​

Update IAM configurations​

Provision credentials to the load balancer​

Reinstall the Helm chart​

Step 4/4. Update DNS records​

Guidance​

Testing your disaster recovery procedure​

Shortening the recovery time objective​

Imposing a change freeze​

Further reading​

How it works

Prerequisites

Step 1/4. Back up Auth Service backends

Cluster state backend

Audit event backend

Session recording backend

Step 2/4. Stop the existing cluster

Step 3/4. Relaunch the Auth Service and Proxy Service

Update your values file

Update IAM configurations

Provision credentials to the load balancer

Reinstall the Helm chart

Step 4/4. Update DNS records

Guidance

Testing your disaster recovery procedure

Shortening the recovery time objective

Imposing a change freeze

Further reading