Managing Disaster Recovery in an Amazon EKS Teleport Cluster
In case of an outage in your cloud provider region, you need to ensure that you can restore your Teleport cluster to working order. This guide provides an overview of a disaster recovery approach for self-hosted Teleport clusters.
The guide assumes that your self-hosted Teleport cluster runs on Amazon Elastic
Kubernetes Service and uses the teleport-cluster
Helm chart. The
teleport-cluster
Helm chart is the recommended approach for quickly getting
started with self-hosting a Teleport cluster on Kubernetes, and you can read
about how to get started with the chart in Deploy Teleport on
Kubernetes.
How it works
In the approach we explain in this guide, AWS backs up the Teleport Auth Service backends to a secondary region. If the primary region becomes unavailable due to an outage, an admin redeploys the cluster to the secondary region, configuring the Teleport Auth Service to connect to new backends in the region.
Since Teleport certificate authorities are already backed up in the new region, running Teleport Agents and bots outside the unavailable region do not need to reconnect to the cluster. In this disaster recovery scenario, the recovery time objective depends on the time it takes to redeploy the Auth Service and Proxy Service in the new region, as well as the time to live (TTL) of the DNS records for the Teleport Proxy Service.
Prerequisites
- Your self-hosted Teleport cluster was launched using the
teleport-cluster
Helm chart. We recommend reading Deploying a High Availability Teleport Cluster for a high-level architectural outline of a self-hosted Teleport cluster. - You are using Amazon DynamoDB for the cluster state backend and audit event backend, and using Amazon S3 for your session recording backend. For information on selecting Teleport backends, see Storage Backends.
This guide is not intended to be a runbook for a regional outage.
Read this guide to help prepare your disaster recovery plan, including any runbooks and automation. We strongly recommend that you regularly test your plan to prevent issues.
Step 1/4. Back up Auth Service backends
The first step in setting up a disaster recovery procedure for your Teleport cluster on Amazon EKS is to back up your Teleport Auth Service backends to a secondary region. If the primary region becomes unavailable, the backend replicas in the secondary region will be ready for your new cluster to connect to once you redeploy it in the secondary region.
The recovery point objective for a Teleport cluster during a regional outage depends on how frequently you back up the Auth Service backend. The more frequent the backups, the fewer backend changes you will lose when you restore the cluster during a disaster.
Cluster state backend
With a backup of the cluster state backend, the Teleport Auth Service can use its existing certificate authorities to sign certificates for cluster components. If you restore a Teleport cluster with a new backend, and do not have a backup, you will need to configure cluster components, such as self-hosted databases protected by Teleport, to trust the new CAs.
Read the Amazon DynamoDB documentation to plan your backup procedure.
AWS Key Management Service users
If the Teleport Auth Service in your cluster uses Amazon Key Management Service
for certificate authority private keys, you must replicate your keys to the new
AWS region before reinstalling the teleport-cluster
Helm chart. See the AWS
documentation
for information about using KMS keys in multiple regions.
Note that KMS support is only possible in the teleport-cluster
Helm chart
using the auth.teleportConfig
values field (chart
reference,
and is not recommended for most teleport-cluster
users.
Audit event backend
In addition to the cluster state backend, you must also back up the audit event backend to a secondary region in order to retain access to audit events. As with the cluster state backend read the Amazon DynamoDB documentation to plan your backup procedure.
Session recording backend
To retain access to your session recordings in the case of region-wide unavailability, you need to back up your session recordings to the secondary region. When you redeploy the Teleport Auth Service to the new region, you can configure it to connect to an S3 bucket in the new region.
You can use S3 replication rules to create continuous backups from your primary region to the backup region. Follow the Multi-Region Blueprint guide to plan multi-region S3 bucket replication for your session recording backend.
Step 2/4. Stop the existing cluster
When you detect a zonal or regional outage, it is important that you cleanly stop the existing Teleport cluster. While outages at the cloud provider level disrupt the expected functioning of a Teleport cluster, some operations in your cluster may continue successfully. After the outage has concluded, services that remained functional or come back online can cause the cluster to behave in unexpected ways.
For example, Teleport Agents can remain connected to a Teleport cluster even after an outage has caused it to become unavailable. This is because agents establish long-lived gRPC connections through the Proxy Service, and maintain these even when the target Proxy Service instances have become unresponsive. You can force-close the connections between agents and the Proxy Service by stopping and restarting the Proxy Service.
If it is possible to do so, stop any Teleport Auth Service and Proxy Service
pods in your cluster. This command assumes that the name of your release is
teleport-cluster
and that it runs in the teleport
namespace:
helm --namespace teleport uninstall teleport-cluster
Note that, in the case of a full regional outage, your existing cluster may already be unavailable.
Step 3/4. Relaunch the Auth Service and Proxy Service
Now that there is a backup of the Auth Service backend, you are safe to deploy the Teleport cluster in a new region.
Update your values file
When deploying the Teleport Auth Service and Proxy Service to the new region,
you need to update the following fields in the values file for the
teleport-cluster
Helm chart:
aws.region
aws.backendTable
aws.auditLogTable
aws.sessionRecordingBucket
You may need to update other values to match third-party dependencies of the
Teleport cluster. For example, if you plan to deploy cert-manager
to handle
Teleport Proxy Service certificates instead of Let's Encrypt, you need to set
the certManager.issuerName
field to match the name of the cert-manager
Issuer in your Kubernetes cluster.
We recommend reading Running an HA Teleport Cluster Using AWS, EKS, and Helm so you can make sure that you have accounted for any third-party dependencies you need to manage in the new AWS region.
Update IAM configurations
You also need to ensure that the roles used by the Auth Service and Proxy Service grant the services permissions to access backends in the new region. Make sure that the trust policies associated with these roles grant access to principals in the new region.
Provision credentials to the load balancer
The teleport-cluster
Helm chart deploys a Kubernetes service that, by default,
has the LoadBalancer
type. Reinstalling the Helm chart in your new AWS region
recreates the load balancer.
If you are using AWS Certificate Manager or cert-manager
to provision TLS
credentials for your load balancer, you must create a new certificate and
private key before installing the chart in the new region.
For guidance on configuring your EKS cluster to use ACM and cert-manager
with
the Teleport Proxy Service load balancer, see Configure TLS certificates for
Teleport.
Reinstall the Helm chart
-
Install the
teleport-cluster
Helm chart using the new values:helm install teleport-cluster teleport/teleport-cluster \ --version 17.0.0-dev \ --values teleport-cluster-values.yaml -
After installing the
teleport-cluster
chart, wait a minute or so and ensure that both the Auth Service and Proxy Service pods are running:kubectl get podsNAME READY STATUS RESTARTS AGEteleport-cluster-auth-000000000-00000 1/1 Running 0 114steleport-cluster-proxy-0000000000-00000 1/1 Running 0 114s -
Once the Teleport Auth Service is running on the new cluster, make sure that you have applied Teleport dynamic resources against the new cluster so the Teleport Kubernetes operator can manage them.
You should manage your Teleport resources as a set of Kubernetes manifests applied using a GitOps tool like Flux CD so that, when you launch the new cluster, you can readily apply them.
Note that, by the time you have redeployed the Auth Service and Proxy Service, any Machine & Workload Identity bots that have joined the cluster with a static token will likely miss their renewal periods and become locked out of the cluster until the Teleport Auth Service can issue new tokens. We recommend using delegated join methods to prevent this scenario.
Step 4/4. Update DNS records
Once you have launched your Teleport cluster in the new AWS region, you must ensure that Route 53 DNS records point to the new cluster.
Ensure that existing DNS records for the Teleport Proxy Service in your cluster have a low TTL, e.g., one minute. We expect that DNS resolvers in your users' networks honor the TTLs of DNS records to prevent issues with propagating records.
The teleport-cluster
Helm chart exposes the Proxy Service to traffic from the
internet using a Kubernetes service that sets up an external load balancer with
your cloud provider.
Obtain the address of your load balancer by following the instructions below.
-
Get information about the Proxy Service load balancer:
kubectl get services/teleport-clusterNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEteleport-cluster LoadBalancer 10.4.4.73 192.0.2.0 443:31204/TCP 89sThe
teleport-cluster
service directs traffic to the Teleport Proxy Service. Notice theEXTERNAL-IP
field, which shows you the IP address or domain name of the cloud-hosted load balancer. For example, on AWS, you may see a domain name resembling the following:00000000000000000000000000000000-0000000000.us-east-2.elb.amazonaws.com
-
Set up two DNS records:
teleport.example.com
for all traffic and*.teleport.example.com
for any web applications you will register with Teleport. We are assuming that your domain name isexample.com
andteleport
is the subdomain you have assigned to your Teleport cluster.Depending on whether the
EXTERNAL-IP
column above points to an IP address or a domain name, the records will have the following details:- IP Address
- Domain Name
Record Type Domain Name Value A teleport.example.com
The IP address of your load balancer A *.teleport.example.com
The IP address of your load balancer Record Type Domain Name Value CNAME teleport.example.com
The domain name of your load balancer CNAME *.teleport.example.com
The domain name of your load balancer -
Once you create the records, use the following command to confirm that your Teleport cluster is running:
curl https://clusterName/webapi/ping{"auth":{"type":"local","second_factor":"on","preferred_local_mfa":"webauthn","allow_passwordless":true,"allow_headless":true,"local":{"name":""},"webauthn":{"rp_id":"teleport.example.com"},"private_key_policy":"none","device_trust":{},"has_motd":false},"proxy":{"kube":{"enabled":true,"listen_addr":"0.0.0.0:3026"},"ssh":{"listen_addr":"[::]:3023","tunnel_listen_addr":"0.0.0.0:3024","web_listen_addr":"0.0.0.0:3080","public_addr":"teleport.example.com:443"},"db":{"mysql_listen_addr":"0.0.0.0:3036"},"tls_routing_enabled":false},"server_version":"17.0.0-dev","min_client_version":"12.0.0","cluster_name":"teleport.example.com","automatic_upgrades":false}
Guidance
Assuming that you have planned your disaster recovery strategy around the steps we lay out in this guide, we recommend the following practices.
Testing your disaster recovery procedure
We strongly recommended testing your disaster recovery plan. Schedule time to stop your Teleport cluster in one region and redeploy in another one using your backup of the Auth Service cluster state backend. After redeploying your cluster, ensure that users can continue to connect to Teleport-protected resources.
Common causes of disaster recovery failures include:
- Misconfigured IAM settings: The Teleport Auth Service in your first region can access its backend, for example, but in the second region, the Auth Service has a role with insufficient permissions.
- Misconfigured backend connections: The Teleport Auth Service is configured with an incorrect cluster state backend URL in the new region, meaning that when it starts up, it fails to retrieve its existing CAs and bootstraps instead with a self-signed certificate.
Shortening the recovery time objective
When planning a Teleport disaster recovery plan, the main consideration for estimating a recovery time objective is how long it will take for the Teleport Auth Service to come online in your secondary region.
You can expect the disaster recovery procedure outlined in this guide to take at least an hour, though the precise details depend on your infrastructure and organization. To arrive at an exact benchmark, we strongly recommend testing your disaster recovery procedure.
You can take measures to shorten the recovery time objective of your disaster recovery procedure. Possibilities include:
- Reduce the TTL of the DNS records for the Teleport Proxy Service. If this is longer than the time it takes to redeploy your cluster to a new region, clients may continue to connect to the cluster in the previous region.
- Restore your cluster state and audit event backend tables from backup prior to
reinstalling the
teleport-cluster
Helm chart, so the Teleport Auth Service does not need to initialize any tables itself. - Run the Auth Service, Proxy Service, and backend services in the secondary region before any regional outage takes place. When there are redundant Teleport cluster deployments across multiple regions, there is no need to lose availability while waiting for your cluster to deploy to a new region. Read about the architecture of a multi-region Teleport deployment in the Multi-Region Blueprint guide.
Imposing a change freeze
During a regional cloud provider outage, the procedure we outline in this guide includes stopping the Teleport cluster in the affected region. Until the cluster comes back online, it is impossible for users to update dynamic resources or rotate the cluster CAs. In some cases, though, you may need to impose a change freeze to prevent users from updating cluster resources while you restore a cluster from a backup.
You can impose a change freeze by locking Teleport roles that include
permissions to edit dynamic resources. Teleport roles allow access to modifying
dynamic resources with the spec.allow.rules
field.
You can use the following jq
command to list all roles with one or more rule
that grants permissions to modify backend resources. This example assumes that
there is a role called locksmith
that allows the user to create, list, read,
and delete locks. It skips the locksmith
user so you can remove the lock after
restoring the cluster:
tctl get roles --format json | jq -r '.[] | select(.metadata.name != "locksmith" and .spec.allow.rules) | .metadata.name'dashboard-admindashboard-userdevice-admindevice-enrollgroup-access
Teleport grants access to certificate authority rotations using the
spec.allow.rules
field - on the cert_authority
resource - so using role
locking to impose a change freeze will also unintended certificate authority
rotations during the disaster recovery procedure.
Create a lock using a tctl lock
command. The following example locks the
contractor
role for 24 hours:
tctl lock --role=contractor --message="change freeze during disaster recovery" --ttl=24h
Note that users with permissions to execute commands directly on the Auth
Service pod can still make changes to the backend using tctl
.
Once it is safe for users to modify cluster resources, use the tctl rm lock/<lock_id>
command to remove each lock. If you provided each lock with a
similar --message
flag value when creating the locks, you can remove all locks
you created with a single command. This command removes all locks with a message
that includes the substring change freeze
:
tctl get lock --format=json \ | jq -r '.[] | select(.spec.message and (.spec.message | contains("change freeze"))) | .metadata.name' \ | xargs -I{} tctl rm lock/{}
Further reading
- Read the Backup and Restore guide for general information on backing up your Teleport cluster backends.
- For more information on preventing Teleport roles from making changes to your cluster state backend, read Session and Identity Locking.