KingbaseES High Availability Disaster Recovery Guide: Cluster Failover, Backup Repair, and Emergency Data Recovery - Devuly | Smart Analytics for Developers & Projects

This article focuses on KingbaseES high availability disaster recovery. It systematically breaks the topic into three major tracks: cluster failover, physical backup repair, and emergency data recovery. It addresses common production issues such as outages, split-brain, archive failures, accidental deletion, and corrupted pages. Keywords: KingbaseES, high availability cluster, disaster recovery.

Table of Contents

Technical Specifications Snapshot

Parameter	Description
Database	KingbaseES
Architecture Pattern	Read/write splitting high availability cluster
Core Protocols / Mechanisms	Streaming Replication, WAL archiving, PITR
Cluster Management	repmgr, repmgrd
High Availability Daemon	kbha
Secure Communication	sys_securecmdd
Backup Tool	sys_rman
Article Popularity	Original article shows 93 views, 5 likes, and 6 bookmarks

KingbaseES disaster recovery must be designed around RTO and RPO

KingbaseES carries core transactional data in critical sectors such as finance, government, and energy. The goal of disaster recovery is not simply to “bring the database back up.” The real goal is to restore service as quickly as possible while minimizing data loss.

From an operations perspective, disaster recovery falls into three categories: cluster recovery, physical backup recovery, and emergency data recovery. These correspond to service availability, recovery executability, and data integrity. In production, you must build all three lines of defense together.

You should align on several core terms first

Primary node: The current primary database that provides read/write services.
Standby node: A replica that provides read-only access or disaster recovery capacity.
RTO: Recovery Time Objective.
RPO: Recovery Point Objective.
repmgr: A cluster management and failover tool.
kbha: A high availability inspection and startup process.

# Check cluster service status
repmgr cluster show
repmgr service status

# Check critical processes
ps -ef | grep kingbase
ps -ef | grep repmgrd
ps -ef | grep kbha

Use these commands to quickly confirm whether the database instance, cluster services, and daemon processes are in a recoverable state.

Recovery in a read/write splitting cluster should start by ruling out infrastructure issues

Before taking any recovery action, confirm that the servers themselves are available. If the network, disks, or passwordless communication are unhealthy, later failover and rebuild operations will usually fail.

A practical baseline inspection should cover six items: network connectivity, available memory, disk capacity and write capability, firewall policy, passwordless communication between nodes, and whether the kbha scheduled tasks are running correctly.

A small set of baseline health-check commands covers most prerequisites

# Check network connectivity
ping 
<peer_ip>

# Check memory
free -h

# Check disk capacity and mount points
df -h

# Check firewall status
systemctl status firewalld

# Check scheduled tasks
cat /etc/cron.d/KINGBASECRON

These commands help eliminate false failures where the commands are correct but the environment itself is broken.

Automatic failover and recovery can cover most common node-level failures

KingbaseES automatic recovery depends on repmgr.conf. Typical settings are recovery = automatic or recovery = standby. The former favors a higher level of automation. The latter is more conservative and is commonly used when you want direct failover after a primary failure while preventing the old primary from rejoining automatically.

Automatic recovery can usually handle primary outages, standby disconnects, process failures, partial network jitter, and some storage-level issues. If automatic recovery is not enabled in production, the failure window expands significantly.

# repmgr.conf
recovery = automatic
# Or recover standby nodes only
# recovery = standby

This setting determines whether a node automatically enters the recovery workflow after a failure occurs.

You must clearly understand the boundaries of automatic recovery

Automation is not universal. In scenarios such as dual-primary, multi-primary, full-cluster outage, or split-brain, you still need manual judgment to identify the real primary, then rebuild and re-register the other nodes. At that point, incorrect operations are more dangerous than the failure itself.

Manual disaster recovery must follow a closed loop of scenario assessment, primary selection, rebuild, and validation

The first step in manual recovery is not restarting services. It is identifying the current failure type in the cluster. The original article groups common cases into single-primary, no-primary, multi-primary, partial node failure, and full-cluster failure. This classification works well in real incident response.

In particular, in a multi-primary scenario, you must first decide which primary should be preserved. The decision usually relies on timeline, WAL LSN, and overall data volume. In most cases, prefer the node with the more advanced timeline and log position.

Primary selection and validation are the two most critical steps

-- Check the timeline to determine which node is closer to the latest consistent state
select timeline_id from sys_control_checkpoint();

-- Check the current WAL LSN; a larger LSN usually indicates more advanced data changes
select sys_current_wal_lsn();

-- Use database size as an additional signal
select sys_size_pretty(sys_database_size(oid)) from sys_database;

Use these SQL statements to identify the primary that should be retained after a multi-primary incident or split-brain event.

Prefer `rejoin` when returning a failed node to the cluster

# Stop the failed node instance
sys_ctl -D $data_path stop

# Rejoin the failed node to the new primary
repmgr -h 
<new_primary_ip> -U esrep -d esrep node rejoin --force-rewind

# If rejoin fails, clone the standby again
repmgr -h 
<new_primary_ip> -U esrep -d esrep -D $data_path standby clone -F --fast-checkpoint
sys_ctl -D $data_path start
repmgr standby register -F

This workflow safely returns an old primary or lagging node to the cluster and prevents it from continuing as a divergent branch.

The stability of the physical backup pipeline defines the upper limit of disaster recovery

KingbaseES physical backup relies on sys_rman. In many production incidents, what makes recovery impossible is not the database failure itself, but a backup pipeline that has been unhealthy for a long time without detection.

Five issues appear most frequently: incorrect repo_ip, incompatible system parameters, out-of-memory termination, WAL archiving not enabled, and backup connections interrupted by timeouts. These may look like small details, but they directly determine whether PITR is possible.

WAL archiving is the most important prerequisite for physical recovery

# Edit the database configuration
vi $data_path/kingbase.conf

# Enable WAL archiving
archive_mode = ON
archive_command = 'sys_rman archive-push %p'

# Restart the instance to apply the configuration
sys_ctl -D $data_path restart

This configuration ensures continuous WAL archiving and provides the log foundation required for point-in-time recovery.

Emergency data recovery should start by determining whether backups and logs exist

Accidental table deletion, accidental database deletion, unintended operations on the data directory, and page corruption all fall into the category of emergency data recovery. The most important action here is not a technical operation. It is to freeze writes immediately so you do not overwrite evidence.

The recoverability assessment is straightforward: if you have a physical backup and complete WAL files, you can perform PITR. If you have only a backup but no logs, you can restore only to the backup point. If you have neither backup nor logs, recovery is usually impossible.

Corrupted-page emergency parameters should be used only for short-term rescue

-- Ignore checksum errors and try to read recoverable data
set ignore_checksum_failure = on;
select * from damaged_table;

-- Zero out damaged pages for emergency extraction only
set zero_damaged_pages = on;
select count(*) from damaged_table;

These two parameters are suitable only for emergency data extraction. They are not appropriate for long-term operation. After using them, return to the standard repair workflow immediately.

Production operations should shift recovery capability left into daily practice

High availability does not mean “we can fix it when it breaks.” It means “we rehearse it before it breaks.” A practical strategy is to enable automatic recovery by default, run physical backups daily, archive WAL continuously, and simulate typical scenarios every month, including primary failure, multi-primary incidents, corrupted pages, and backup failure.

Before any recovery, always back up the data directory and logs. Any direct repair without preserving on-site evidence can destroy clues, narrow the recovery path, and even escalate a logical failure into a physical one.

KingbaseES Column Cover

AI Visual Insight: The image shows the cover for a KingbaseES database article series. It is primarily a visual identifier for the technical column rather than a system architecture diagram, so it does not contain incident recovery workflow details for analysis.

The core value of this recovery methodology lies in standardization rather than tricks

KingbaseES disaster recovery best practices are not inherently complicated. The key is to standardize scenario classification, diagnostic commands, primary selection rules, node reintegration, and recovery validation. That is how you compress complex incidents into executable steps.

For DBA and operations teams, three capabilities matter most: identifying the failure type, executing standard commands, and verifying final consistency. When you do these three things well, RTO and RPO become controllable.

FAQ

1. In production, should KingbaseES use `automatic` or `standby` first?

If your goal is to minimize manual intervention as much as possible, choose automatic first. If you want a more conservative failover strategy after a primary failure and do not want the old primary to recover automatically, choose standby.

2. Why should you not simply restart every node during a multi-primary failure?

Because a multi-primary condition usually means data divergence may already exist. Restarting every node can amplify the conflict. The correct approach is to compare the timeline, LSN, and data volume first, identify the single primary to preserve, and then let the other nodes rejoin or reclone.

3. What should you do first after accidental data deletion?

Stop application writes immediately and preserve the current state. Then verify whether the physical backup and WAL files are complete. If both exist, PITR is usually the preferred path. If you have only logs, log parsing may still help. If you have neither, the probability of successful recovery is very low.

AI Readability Summary

This article reconstructs a complete KingbaseES high availability disaster recovery framework. It covers automatic and manual recovery for read/write splitting clusters, physical backup failure repair, and emergency strategies for corrupted pages and accidental deletion. It also provides directly executable troubleshooting commands, recovery workflows, and operational recommendations.