Skip to main content

Machine ID Troubleshooting Guide

This page provides resolution steps for issues that you may come across when setting up Machine ID.

A bot failed to renew a certificate due to a "generation mismatch"

Symptoms

The bot will log an error like this:

ERROR: renewable cert generation mismatch: stored=3, presented=2

Subsequent connection attempts by the bot may see errors like the following:

ERROR: failed direct dial to auth server: auth API: access denied [00]
"\tauth API: access denied [00], failed dial to auth server through reverse tunnel: Get \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://example.com:3025/webapi/find\": x509: cannot validate certificate for example.com because it doesn't contain any IP SANs"
"\tGet \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://example.com:3025/webapi/find\": x509: cannot validate certificate for example.com because it doesn't contain any IP SANs"

In particular, note the message auth API: access denied.

In self-hosted Teleport deployments, the Teleport Auth Service will also provide some additional context:

[AUTH]      WARN lock targeting User:"bot-example" is in force: The bot user "bot-example" has been locked due to a certificate generation mismatch, possibly indicating a stolen certificate. auth/apiserver.go:224

Explanation

note

This applies only to bots using the token join method, which makes use of one-time use shared secrets. Provider-specific join methods, such as GitHub, AWS IAM, etc will not be locked in this fashion unless another instance of the bot uses token joining.

Machine ID (with token-based joining) uses a certificate generation counter to detect potentially stolen renewable certificates. Each time a bot fetches a new renewable certificate, the Auth Service increments the counter, stores it on the backend, and embeds a copy of the counter in the certificate.

If the counter embedded in your bot certificate doesn't match the counter stored in Teleport's Auth Server, the renewal will fail and the bot user will be automatically locked.

Renewable certificates are exclusively stored in the bot's internal data directory, by default /var/lib/teleport/bot. It's possible to trigger this by accident if multiple bots are started using the same internal data directory, or if this internal data is otherwise being shared between multiple tbot processes.

Additionally, if a bot fails to save its freshly renewed certificates (for example, due to a filesystem error) and crashes, it will attempt a renewal with old certificates and trigger a lock.

Resolution

Before unlocking the bot, try to determine if either of the two scenarios described above apply. If the certificates were stolen, there may be underlying security concerns that need to be addressed.

Otherwise, first ensure only one tbot process is using the internal data directory. Multiple bots can be run on a single system, but separate data directories must be configured for each.

Additionally, ensure the internal data is not being shared with or copied to any other nodes, for example via a shared NFS volume. If you'd like to share certificates between nodes, only copy or share content from destination directories (usually /opt/machine-id) rather than the internal data directory (by default, /var/lib/teleport/bot).

Once you have addressed the underlying cause, follow these steps to reset a locked bot:

  1. Remove the lock on the bot's user
  2. Reset the bot's generation counter by creating a new bot instance

To remove the lock, first find and remove the lock targeting the bot user. For this example, we'll assume the bot is named example, which will have an associated Teleport user named bot-example:

tctl get locks
kind: lockmetadata: id: 1658359514703080513 name: 5cee949f-5203-4f3b-9805-dac35d798a16spec: message: The bot user "bot-example" has been locked due to a certificate generation mismatch, possibly indicating a stolen certificate. target: user: bot-exampleversion: v2
tctl rm lock/5cee949f-5203-4f3b-9805-dac35d798a16

Next, use tctl bots instances add to generate a new join token for the preexisting bot example:

tctl bots instances add example

Finally, reconfigure the local tbot instance with the new token and restart it. It will detect the new token and automatically reset its internal data directory. The bot will be issued a new bot instance UUID once connected, and the generation counter will be reset.

tbot shows a "bad certificate error" at startup

Symptoms

Restarting a tbot process outputs a log like the following:

INFO [TBOT]      Successfully loaded bot identity, valid: after=2022-07-21T21:49:26Z, before=2022-07-21T22:50:26Z, duration=1h1m0s | kind=tls, renewable=true, disallow-reissue=false, roles=[bot-test], principals=[-teleport-internal-join], generation=2 tbot/tbot.go:281
ERRO [TBOT]      Identity has expired. The renewal is likely to fail. (expires: 2022-07-21T22:50:26Z, current time: 2022-07-25T20:18:33Z) tbot/tbot.go:415
WARN [TBOT]      Note: onboarding config ignored as identity was loaded from persistent storage tbot/tbot.go:288
ERRO [TBOT]      Failed to resolve tunnel address Get "https://auth.example.com:3025/webapi/find": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs reversetunnel/transport.go:90
ERRO [TBOT]      Failed to resolve tunnel address Get "https://auth.example.com:3025/webapi/find": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs reversetunnel/transport.go:90
ERROR: failed direct dial to auth server: Get "https://teleport.cluster.local/v2/configuration/name": remote error: tls: bad certificate
"\tGet \"https://teleport.cluster.local/v2/configuration/name\": remote error: tls: bad certificate, failed dial to auth server through reverse tunnel: Get \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://auth.example.com:3025/webapi/find\": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs"
"\tGet \"https://teleport.cluster.local/v2/configuration/name\": Get \"https://auth.example.com:3025/webapi/find\": x509: cannot validate certificate for auth.example.com because it doesn't contain any IP SANs"

In particular, note the log line: "Identity has expired. The renewal is likely to fail."

Explanation

Token-joined bots are unable to reauthenticate to the Teleport Auth Service once their certificates have expired. Tokens in token-based joining (as opposed to AWS IAM and other join methods) can only be used once, so when the bot's internal certificates expire, it will not be able to connect.

When a bot's identity expires, certain parameters associated with the bot on the Auth Service must be reset and a new joining token must be issued. The simplest way to accomplish this is by removing and recreating the bot, which purges all server-side data and issues a new joining token.

Resolution

Use tctl bots instances add to create a new one-time use token for the bot:

tctl bots instances add example

Copy the resulting join token into the existing bot config—either the --token CLI flag or the onboarding.token parameter in tbot.yaml—and restart the bot. It will detect the new token and rejoin the cluster as normal.

SSH connections fail with ssh: handshake failed: ssh: unable to authenticate

Symptoms

When attempting to connect to a node via SSH, connections fail with an error like the following:

ssh -F /opt/machine-id/ssh_config bob@node.example.com
ERROR: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
ERROR: unable to execute tshexecuting `tsh proxy`exit status 1
kex_exchange_identification: Connection closed by remote hostConnection closed by UNKNOWN port 65535

In particular, note the ssh: unable to authenticate message.

Explanation

This can occur when attempting to log into the node as a user not listed as a principal on the SSH certificate.

You can verify this by viewing the tbot logs and looking for the log message when impersonated certificates for the matching outputs were renewed.

In the following example, the only principal listed for the identity in /opt/machine-id is alice (via the access role):

INFO [TBOT]      Successfully renewed impersonated certificates for directory /opt/machine-id, valid: after=2022-07-21T21:49:26Z, before=2022-07-21T22:50:26Z, duration=1h1m0s | kind=tls, renewable=false, disallow-reissue=true, roles=[access], principals=[alice -teleport-internal-join], generation=0 tbot/renew.go:630

However, the SSH command attempted to log in as bob.

Resolution

Ensure the bot identity is allowed to log in as the requested user by taking any of the following actions:

  • Changing the SSH command to log in as an allowed user
  • Modifying the access role to allow the alice principal
  • Adding a role granting login via the bob principal

Note that if roles are added or modified, the certificates will need to be renewed for the changes to take effect. The bot will renew certificates on its own after the renewal interval (by default, 20 minutes), but you can trigger a renewal immediately by either restarting the tbot process or sending it a reload signal:

# If using systemd, you can restart the process:

systemctl restart machine-id

# Alternatively, you can send `tbot` a reload signal directly:

pkill -sigusr1 tbot

Database requests fail with database "example" not found, but the database exists

Symptoms

When requesting Database Access certificates, the certificate request fails with an error like the following:

ERROR: Failed to generate impersonated certs for directory /opt/machine-id: database "example" not found
database "example" not found

However, the database exists and can be seen by regular users via tsh:

tsh db ls
Name Description Allowed Users Labels Connect---------- ----------- ------------- ------- -------example [alice] env=dev

Explanation

Unlike regular Teleport users, Machine ID bot users are granted only minimal Teleport RBAC permissions and are not allowed to view or list databases by default unless granted permission via one or more roles.

Resolution

Per the Machine ID Database Access Guide, ensure at least one role providing database permissions has been granted to the output listed in the error.

For example, note the rules section in the following example role:

kind: role
version: v5
metadata:
  name: machine-id-db
spec:
  allow:
    db_labels:
      '*': '*'
    db_names: [example]
    db_users: [alice]
    rules:
      - resources: [db_server, db]
        verbs: [read, list]

Ensure the bot has a role that grants it at least these RBAC rules. If desired you can examine bot roles with tctl to ensure the necessary rules have been granted:

tctl get role/machine-id-db

If the role is missing database permissions, it can be modified in your text editor:

tctl edit role/machine-id-db

Edit the role, then save and close the file to apply your changes.

tip

You can also create and edit roles using the Web UI. Go to Access -> Roles and click Create New Role or pick an existing role to edit.

note

By default, outputs (like /opt/machine-id) are granted all roles provided to the bot via tctl bots add --roles=..., but it's possible to grant only a subset of these roles using the roles: ... parameter in tbot.yaml.

If permissions are unexpectedly missing, ensure tbot.yaml requests your database role, either by relying on default behavior or adding the role to the roles: ... list.

Once fixed, restart or reload the tbot clients for the updated role to take effect.

If the bot was not granted the role initially, the simplest solution is to delete and recreate the bot, being sure to include the role in the --roles=... flag:

tctl bots rm example
tctl bots add example --roles=foo,bar,machine-id-db

Destination kubernetes_secret: identity-output must be a directory in exec plugin mode

By default, when outputting a Kubernetes identity, tbot outputs make use of a Kubernetes exec plugin to always provide the latest version of the credentials.

When outputting a Kubernetes identity to a Kubernetes secret, however, it is important to disable the use of the exec plugin by adding disable_exec_plugin: true to the output. This means that a static kubeconfig file with embedded short-lived credentials is written instead:

outputs:
  - type: kubernetes
    # Specify the name of the Kubernetes cluster you wish the credentials to
    # grant access to.
    kubernetes_cluster: example-k8s-cluster
    # Required when outputting a Kubernetes identity to a Kubernetes secret.
    disable_exec_plugin: true
    destination:
      type: kubernetes_secret
      # For this guide, identity-output is used as the secret name.
      # You may wish to customize this. Multiple outputs cannot share the same
      # destination.
      name: identity-output

Failure to add the disable_exec_plugin flag will result in a warning being displayed: Destination kubernetes_secret: identity-output must be a directory in exec plugin mode.

Configuring tbot for split DNS proxies

When you have deployed your Proxy Service in such a way that it is accessible via two different DNS names, e.g an internal and external address, you may find that a tbot that is configured to use one of these addresses may attempt to use the other address and that this may cause connections to fail.

This is because tbot queries an auto-configuration endpoint exposed by the Proxy Service to determine the canonical address to use when connecting.

To fix this, set a variable of TBOT_USE_PROXY_ADDR=yes in the environment of the tbot process. This configures tbot to prefer using the address that you have explicitly provided. This only functions correctly in cases where TLS routing/multiplexing is enabled for the Teleport cluster.