KingbaseES High Availability Architecture: Single-Node Hardening, Self-Healing Clusters, and Backup Recovery Best Practices

KingbaseES provides a four-layer high availability framework: single-node protection, primary-standby replication, shared-storage clusters, and backup recovery. Together, they address business continuity, self-healing after failures, and reliable data recovery for mission-critical systems. Keywords: KingbaseES, High Availability, Backup and Recovery.

Technical specifications provide a quick deployment snapshot

Parameter Description
Database Engine KingbaseES
Primary Languages / Environment Shell, INI, SQL/JDBC configuration
Core Protocols / Mechanisms Streaming Replication, VIP Failover, Corosync, Pacemaker, PITR
Cluster Management Components repmgr, sys_monitor.sh
Backup Components sys_backup.sh, sys_rman
Default Service Port 54321
Reference Popularity Original article views: 282, likes: 22, bookmarks: 18
Applicable Scenarios Mission-critical workloads in finance, government, energy, telecommunications, and similar sectors

KingbaseES high availability covers the full production lifecycle

KingbaseES targets a maximum availability architecture and organizes its capabilities into four layers: single-node high availability, read/write splitting clusters, shared-storage clusters, and backup and recovery. It is not a single component, but a closed-loop solution that spans prevention, failover, and recovery.

The first layer targets development, testing, or non-critical systems, with the goal of making a standalone instance stable and recoverable. The second layer is a one-primary, multiple-standby streaming replication cluster that supports both read/write splitting and automated failover. The third layer uses shared storage and resource orchestration to reduce RTO even further. The fourth layer acts as the final safeguard, covering accidental deletion, corruption, and site-level disasters.

The selection logic across the four layers is straightforward

If the business can tolerate short interruptions, single-node hardening is sufficient. If you need to balance cost and availability, choose a read/write splitting cluster first. If the system is highly sensitive to failover time and requires recovery within seconds, then evaluate Clusterware. Regardless of the architecture you choose, backup and recovery must be built as an independent capability.

Single-node high availability starts with a stable operational foundation

Single-node mode is not a “lighter version” of high availability. It is the starting point for every architecture. Operating system settings, resource limits, kernel parameters, WAL archiving, and service auto-start determine whether an instance is truly recoverable.

# Disable SELinux to prevent security policies from blocking database file access
sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
setenforce 0

# Increase file descriptors and process limits to support high concurrency
cat >> /etc/security/limits.conf << 'EOF'
* soft nofile 65536
* hard nofile 65536
* soft nproc 65536
* hard nproc 65536
EOF

# Configure kernel semaphores to stabilize inter-process communication
echo 'kernel.sem = 5010 64128000 50100 1280' >> /etc/sysctl.conf
sysctl -p

This script hardens a standalone environment and provides the minimum prerequisites for stable database operation.

Database parameters define the upper bound of recoverability

Archiving, WAL settings, connection limits, and log collection are the four most critical parameter groups in a single-node environment. If archiving is not enabled, physical backup and point-in-time recovery are effectively impossible later.

listen_addresses = '*'
port = 54321
max_connections = peak business connections * 1.2   ; Reserve headroom above peak concurrency
shared_buffers = 1/3 of physical memory             ; Balance cache usage and OS memory availability
wal_level = replica                                 ; Provide WAL level required for replication and recovery
archive_mode = on                                   ; Enable archiving capability
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
wal_keep_segments = 512                             ; Retain enough WAL to prevent catch-up failure
logging_collector = on                              ; Enable log collection
log_destination = 'csvlog'                          ; Simplify structured log analysis

These parameters give a standalone instance the baseline capabilities required to be observable, recoverable, and portable.

Read/write splitting clusters are the mainstream production model for KingbaseES

This model builds a one-primary, multiple-standby topology on top of streaming replication. The primary handles writes, while the standbys serve read traffic. repmgr manages node monitoring, failure detection, failover, and rejoin operations. On the application side, cluster awareness is typically implemented through a VIP or JDBC routing.

Before deployment, make sure all nodes are consistent: clock skew should ideally remain below 2 seconds, port 54321 and management ports must be open, primary and standby nodes should use similar hardware specifications, and passwordless communication between nodes must be configured.

Streaming replication parameters form the consistency baseline

wal_level = replica
max_wal_senders = 32
hot_standby = on
hot_standby_feedback = on
synchronous_commit = remote_apply
wal_log_hints = on
full_page_writes = on

These settings enable read-only standby access, strengthen consistency, and reduce data risk after a failover.

Cluster management settings directly control failover behavior

node_id = 1
node_name = 'node1'
conninfo = 'host=192.168.2.129 user=esrep dbname=esrep port=54321 connect_timeout=10'
failover = 'automatic'
recovery = 'automatic'
monitor_interval_secs = 2
reconnect_attempts = 3
reconnect_interval = 5
virtual_ip = '192.168.2.200/24'
net_device = 'eth0'

This repmgr configuration defines node identity, health check intervals, automated failover, and VIP migration behavior.

JDBC can provide transparent read/write splitting at the application layer

USEDISPATCH=true
HOST=192.168.2.129
PORT=10001
DBNAME=TEST
SLAVE_ADD=192.168.2.130,192.168.2.131
SLAVE_PORT=10002,10003
nodeList=node1,node2,node3
HOSTLOADRATE=50

This configuration allows the application to distribute read requests to standby nodes with minimal code changes.

After the primary fails, the cluster usually elects a new primary based on LSN, priority, and node health. Once VIP migration succeeds, business connections automatically switch to the new primary. After the original primary is repaired, it can rejoin the cluster, which avoids a manual full rebuild.

# Show cluster status
repmgr cluster show

# Perform a manual switchover
repmgr standby switchover

# Rejoin a failed node to the new primary cluster
repmgr node rejoin -h NewPrimaryIP -U esrep -d esrep --force-rewind

# Start or stop the high-availability monitoring service
sys_monitor.sh start
sys_monitor.sh stop

These commands cover the core operational tasks for cluster inspection, switchover, and failed-node recovery.

Shared-storage clusters fit core systems with extremely low RTO requirements

Clusterware uses Corosync for membership management and Pacemaker for resource scheduling. When needed, it also works with qdevice quorum to prevent split-brain in two-node environments. It is better understood as a resource-level high availability platform rather than a simple primary-standby database topology.

Quorum and resource ordering determine cluster reliability

totem {
  version: 2
  token: 60000
  cluster_name: 2nodes
}
nodelist {
  node { ring0_addr: node1 nodeid: 1 }
  node { ring0_addr: node2 nodeid: 2 }
}
quorum {
  provider: corosync_votequorum
  expected_votes: 3
  device {
    votes: 1
    model: disk
    disk { label: "2nodes" }
  }
}

This Corosync configuration defines heartbeat behavior, the node list, and quorum disk strategy, all of which directly affect split-brain protection.

# Configure the VIP resource
crm configure primitive FIP1 ocf:heartbeat:IPaddr2 params ip="192.168.4.135" cidr_netmask="24" nic="bond0"

# Configure the shared filesystem resource
crm configure primitive FILESYSTEM1 ocf:heartbeat:Filesystem params device="-U DiskUUID" directory="/sharedata/data1" fstype="ext4"

# Configure the database resource
crm configure primitive DB1 ocf:kingbase:kingbase params kb_data="/sharedata/data1/data" kb_port=54321 op monitor interval=9s

# Group resources to enforce startup order and avoid resource drift
crm configure group DB1_GROUP FIP1 FILESYSTEM1 DB1

These commands bind the IP address, filesystem, and database into a single resource group to ensure consistent startup order and failover behavior.

Backup and recovery remain the final safety net for every HA design

High availability solves service continuity. Backup and recovery solve data survivability. KingbaseES supports full, incremental, differential, and point-in-time recovery, which makes it suitable for accidental deletion, storage corruption, and datacenter-level failures.

Backup configuration must clearly define roles, repository, and retention

_target_db_style="cluster"
_one_db_ip="192.168.1.11"
_repo_ip="192.168.1.66"
_stanza_name="kingbase"
_os_user_name="kingbase"
_repo_path="/home/kingbase/back_dir"
_repo_retention_full_count=5
_crond_full_days=7
_crond_incr_days=1
_use_scmd=on

This configuration defines the backup target, repository location, retention policy, and automation model.

# Initialize the backup system
sys_backup.sh init

# Start scheduled backup jobs
sys_backup.sh start

# Run a full backup
sys_rman --config=/back_dir/sys_rman.conf --stanza=kingbase --type=full backup

# Restore to a specific point in time
sys_rman --config=/back_dir/sys_rman.conf --stanza=kingbase --type=time --target='2026-04-29 10:00:00' restore

These commands cover the full backup workflow from initialization and scheduled execution to PITR restoration.

Long-term stability depends on standardized operations

During patch upgrades, process standby nodes first in a rolling sequence, then upgrade the primary. Apply hot parameters with reload, and schedule a maintenance window for cold parameters that require a restart. To add a node, register it through repmgr. Before removing a node, complete role transition first and clean up the replication slot.

From a monitoring perspective, track at least five categories of metrics: whether each node is running, whether the primary is unique, whether the database process is online, whether replication is in streaming state, and whether replication lag stays below 10 seconds. An HA system that has never been tested in drills is not truly available.

FAQ provides structured answers for production decisions

Q1: Which high-availability architecture should you choose first in production?

A: In most scenarios, a read/write splitting cluster should be your first choice. It offers the best balance of cost, complexity, scalability, and automated failover, making it the default production architecture for most enterprises.

Q2: Why do you still need a backup system if primary-standby replication is already in place?

A: Primary-standby replication only protects against node-level failures. It does not solve accidental deletion, logical corruption, ransomware damage, or historical point-in-time rollback. Without backups, you do not have a true data protection safety net.

Q3: How can you tell whether a cluster has hidden failure risks?

A: Focus on replication lag, streaming status, the current VIP location, primary uniqueness, and quorum status. Then validate the full self-healing chain with primary-failure drills to confirm that failover works as expected.

AI Readability Summary

This article reconstructs the KingbaseES high-availability design into a practical production guide. It covers single-node hardening, read/write splitting clusters, Clusterware-based shared-storage clusters, and backup and recovery. It also distills the essential configuration patterns, failover logic, operational standards, and monitoring indicators required for real-world deployment of a domestic enterprise database platform.