Kubernetes etcd Backup and Restore in Practice: Standard Disaster Recovery for kubeadm Clusters

The real state of a Kubernetes cluster lives in etcd. Backing up etcd means backing up the control plane configuration and runtime state. This article focuses on kubeadm-based environments and breaks down the standard snapshot, restore, rollback, and validation workflow to address common production questions such as which endpoint to use, why writes must stop first, and whether copying the data directory is safe. Keywords: Kubernetes, etcd, disaster recovery.

Technical specification snapshot

Parameter Description
Core topic Kubernetes etcd backup and restore
Deployment method kubeadm
Primary language Shell / YAML infrastructure operations
Communication protocols HTTPS, gRPC (etcd v3 API)
Typical port 2379
Key paths /etc/kubernetes/pki/, /etc/kubernetes/manifests/, /var/lib/etcd
Core dependencies kubectl, etcdctl, kubelet
Stars Not provided in the original article

etcd is the state foundation of Kubernetes, not just another component

etcd is the distributed key-value store for the Kubernetes control plane. Pods, Deployments, Services, Nodes, authentication and authorization data, and cluster metadata are all written to it. The controllers and the API server depend on etcd to read and commit state, so an etcd failure usually means the entire cluster loses its source of truth.

For operators, backing up etcd is not simply about preserving one service’s data. It creates a disaster recovery anchor point for the entire cluster. As long as the snapshot is intact, most control plane objects have a recoverable foundation.

kubeadm directory conventions determine the recovery workflow

kubeadm automatically generates control plane certificates, static Pod manifests, and the etcd data directory. That gives backup and restore operations a standard set of paths to follow.

Path Purpose
/etc/kubernetes/pki/ Stores TLS certificates for the API server, etcd, and related components
/etc/kubernetes/manifests/ Static Pod manifest directory continuously watched by kubelet
/var/lib/etcd Default etcd data directory
kubectl config use-context kubernetes-admin@kubernetes
# Switch to the administrator context generated by kubeadm
kubectl config current-context
# Confirm that the current context is active
kubectl config get-contexts
# List all available local contexts

Use these commands before high-risk operations to confirm that you are using administrator credentials against the correct cluster.

Backups must use etcdctl snapshots instead of copying the data directory

In a live production cluster, the recommended method is always etcdctl snapshot save. It creates a consistent snapshot through the etcd v3 API, does not require downtime, and can be validated directly with snapshot status.

Directly copying /var/lib/etcd is a cold backup approach. If etcd is writing at the time of the copy, the directory contents may be only partially written. Even if the copy succeeds, recoverability is not guaranteed.

The standard backup command relies on localhost and TLS certificates

apt install etcd-client -y
# Install the etcdctl client

ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /srv/etcd-snapshot.db
# Generate a consistent snapshot file through the local etcd endpoint

This command depends on three essentials: using the v3 API, connecting to 127.0.0.1:2379, and presenting the etcd TLS certificates for authentication.

Why not use the physical IP address of the control plane node? Because kubeadm usually runs etcd as a static Pod and, by default, etcd commonly listens on the local loopback address. On the current control plane node, 127.0.0.1 means local in-host access. It is the most stable option and aligns with the default certificate trust chain.

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /srv/etcd-snapshot.db
# Validate the snapshot version, revision, and size to confirm that the backup is usable

Use this command to verify that the snapshot is readable so you do not mistake a corrupted file for a valid recovery point.

Control plane writes must stop before restore to avoid state divergence

Restoring etcd means returning the cluster to a historical point in time. If the API server and controller-manager continue writing during the restore process, the old and new states will conflict. The first step is therefore not restore, but write suspension.

In a kubeadm-based architecture, the most direct method is to move the static Pod manifest directory out of the way. Once kubelet detects that the manifests are gone, it stops recreating core components such as the API server and etcd.

The standard restore workflow has three phases: stop writes, restore data, and restart

mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak
# Move the static Pod manifests to stop control plane writes
mv /var/lib/etcd /var/lib/etcd.bak
# Back up the existing data directory and preserve a rollback path

This step freezes the control plane and protects the old data so that you can still roll back if the restore fails.

ETCDCTL_API=3 etcdctl \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --data-dir /var/lib/etcd \
  snapshot restore /srv/etcd-snapshot.db
# Restore the snapshot into a new etcd data directory

This command expands the snapshot into a new data directory that etcd will use when it starts again.

mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
# Restore the static Pod manifest directory
systemctl restart kubelet.service
# Restart kubelet to bring the control plane components back up

This step allows kubelet to recreate etcd, the API server, and the other control plane components from the manifests.

Post-restore validation must cover nodes, system Pods, and application resources

A completed restore does not automatically mean the cluster is usable. You must verify that the control plane has converged again. A practical approach is to validate three layers: nodes, system components, and application objects.

kubectl get nodes
# Check whether nodes return to Ready
kubectl get pods -n kube-system
# Check the status of the control plane and core system components
kubectl get pods -A
# Check whether application namespace objects have returned to the expected state

Use these commands to confirm that the restored control plane, system services, and application resources all match the snapshot point.

Common pitfalls and boundary conditions should be identified early

In a highly available multi-control-plane cluster, the backup strategy does not change. You should still prefer running the snapshot from a local endpoint on one control plane node. Because the etcd cluster uses Raft for consistency, any healthy member can usually export a snapshot.

You cannot omit the certificate parameters. kubeadm enables etcd TLS by default, so leaving out --cacert, --cert, or --key will often cause connection failures or TLS handshake errors.

mv /var/lib/etcd.bak /var/lib/etcd
# Restore the original etcd data directory
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
# Restore the static Pod manifests
systemctl restart kubelet.service
# Bring the original control plane back online for rollback after a failed restore

Use this command set for fast rollback after a failed restore, assuming you preserved backups of both the manifests directory and the original data directory.

An executable checklist reduces disaster recovery errors

Step Core action
1 Switch to the administrator context
2 Install etcdctl
3 Run snapshot save
4 Validate the snapshot with snapshot status
5 Move manifests and the old data directory before restore
6 Run snapshot restore
7 Restore manifests and restart kubelet
8 Validate nodes, kube-system, and application resources

FAQ

FAQ 1: Why does etcd backup connect to 127.0.0.1:2379?

Because etcd deployed by kubeadm usually listens only on the local loopback address, and local access best matches the default TLS certificate trust configuration. If you run the same command on another node, 127.0.0.1 points to that machine itself, not to the etcd instance on the control plane node.

FAQ 2: Why can’t I directly copy /var/lib/etcd for routine backups?

Because etcd data changes continuously while the service is running. Copying the directory directly does not guarantee consistency and usually produces an unverifiable cold backup. In production, you should use etcdctl snapshot save to create an officially supported hot snapshot.

FAQ 3: Why must I stop control plane writes before restoring etcd?

Because restore rolls the cluster state back to a historical point in time. If the API server and controllers continue writing, the new state will conflict with the snapshot state, leading to state divergence, component instability, or even restore failure.

Core summary

This article systematically reconstructs the etcd backup and restore workflow for kubeadm environments. It explains why etcd is the core of Kubernetes state, provides standard commands for snapshot, restore, write suspension, rollback, and validation, and clarifies the critical role of 127.0.0.1, TLS certificates, and the static Pod manifest directory.