MariaDB Galera Troubleshooting represents a critical sub-discipline within the management of high availability (HA) database clusters. In the modern technical stack, particularly within Cloud, Energy management systems, and high-load Network infrastructure, the database serves as the authoritative state engine. When MariaDB Galera clusters experience a partition or a “split-brain” event, the impact on throughput and concurrency can be catastrophic. This manual provides a rigorous framework for diagnosing and remediating cluster failures. It focuses on the “Problem-Solution” context where infrastructure architects must restore service without compromising data integrity. The replication mechanism relies on synchronous write-set certification; therefore, any latency or packet-loss in the underlying network fabric directly degrades the database layer. Effective troubleshooting requires an understanding of how the wsrep (Write Set Replication) provider interacts with the Linux kernel and the local file system to ensure that all operations remain idempotent across the global cluster state.
Technical Specifications
| Requirement | Default Port / Operating Range | Protocol / Standard | Impact Level (1-10) | Recommended Resources |
| :— | :— | :— | :— | :— |
| SQL Client Traffic | 3306 | TCP/IP | 10 | 4 vCPU / 8GB RAM Minimum |
| Galera Replication | 4567 | GCS (Galera Comm. Sys.) | 9 | Low Latency Interconnect |
| State Transfer (IST) | 4568 | TCP | 7 | High Disk I/O (NVMe preferred) |
| State Snapshot (SST) | 4444 | rsync / mariabackup | 8 | 10Gbps Network Interface |
| Quorum Maintenance | 3-Node Minimum | Paxos-like Consensus | 10 | ECC Memory (Strict parity) |
The Configuration Protocol
Environment Prerequisites:
Successful MariaDB Galera Troubleshooting requires a stable operating environment. All nodes must run MariaDB 10.4 or higher, as older versions lack advanced grastate.dat recovery features. The system must adhere to strict time-synchronization via NTP or Chrony; even minor clock drift can cause certification failures. User permissions require root or sudo access for service management and a dedicated sstuser within the database for state transfers. SELinux should be configured in permissive mode or updated with specific policies to allow mysqld_t access to replication ports.
Section A: Implementation Logic:
The engineering design of Galera is based on the principle of virtually synchronous replication. Unlike traditional primary-replica setups, Galera ensures that every node contains the exact same data at completion of a transaction commit. This is achieved through the encapsulation of write-sets that are broadcast to all nodes for certification. If a node cannot verify the payload against its local sequence number, it will drop from the cluster to prevent divergence. The “Why” behind the manual bootstrap logic is to establish a “Primary Component.” Without a primary component, the cluster remains in a non-primary state to prevent “split-brain” scenarios where two nodes might accept different writes for the same record.
Step-By-Step Execution
1. Identify the Failure State and Process Health
Execute systemctl status mariadb on all nodes to determine which processes are active and which have terminated unexpectedly. Check the systemd journal via journalctl -u mariadb for immediate exit codes related to memory allocation or file permissions.
System Note: This command interacts with the systemd service manager to poll the PID (Process ID) status and the exit signal from the mysqld binary; if the service is “failed,” it indicates a crash or a forced stop by the kernel OOM (Out Of Memory) killer.
2. Inspect Cluster Status via SQL Interface
If the process is running, log into the database and run SHOW STATUS LIKE ‘wsrep_cluster_status’; followed by SHOW STATUS LIKE ‘wsrep_connected’;. A healthy node returns “Primary” and “ON” respectively.
System Note: This query accesses the internal wsrep API variables that communicate with the galera.so library; it reveals whether the node is part of a functional quorum or if it is in “Non-primary” mode due to network isolation.
3. Determine the Most Advanced Node
In a full cluster shutdown, navigate to the data directory, typically /var/lib/mysql/, and examine the grastate.dat file. Locate the seqno (Sequence Number) and the safe_to_bootstrap flag.
System Note: The grastate.dat file provides a persistent record of the last committed transaction; the node with the highest seqno is the only candidate safe for bootstrapping to prevent data loss.
4. Force Primary Component Creation
If all nodes show safe_to_bootstrap: 0, identify the node with the highest seqno and manually edit the grastate.dat file using vi /var/lib/mysql/grastate.dat to set safe_to_bootstrap: 1. Alternatively, run galera_new_cluster on the healthiest node.
System Note: Manual intervention in grastate.dat overrides the safety checks of the Galera Replication plugin; it tells the service that this node is the authoritative source of truth, effectively resetting the cluster uuid and sequence flow.
5. Rejoin Secondary Nodes and Monitor SST
Start the MariaDB service on the remaining nodes with systemctl start mariadb. Monitor the progress of the State Snapshot Transfer (SST) by watching the log file at /var/log/mysql/error.log.
System Note: During a join, the provider initiates either an IST (Incremental) or SST (Full) depending on the available GCache; this process consumes significant throughput as the entire data payload is synchronized via the rsync or mariabackup utility.
Section B: Dependency Fault-Lines:
Troubleshooting often fails when engineers overlook the networking layer. Packet-loss on port 4567 is the primary cause of nodes flapping between “Joined” and “Dropped” states. High signal-attenuation in long-distance fiber links can cause timeouts during the certification phase. Furthermore, disk I/O bottlenecks can create an overhead so high that the node cannot keep up with the replication stream, triggering “Flow Control.” If the wsrep_cluster_address variable in /etc/mysql/mariadb.conf.d/60-galera.cnf contains incorrect IP addresses, the nodes will fail to form a mesh network, resulting in a “Connection Refused” error.
THE TROUBLESHOOTING MATRIX
Section C: Logs & Debugging:
The primary diagnostic tool is the MariaDB error log. Look for the string “WSREP: Gap in state sequence” which indicates that the node has missed too many transactions to perform an IST and must perform a full SST. If you see “WSREP: Certification failure,” it suggests high concurrency conflicts where two nodes attempted to modify the same row simultaneously.
Visual Cues and Fault Codes:
1. Error Code 1146 (Table doesn’t exist): Often seen during a failed SST where the schema was partially created. Solution: Clear the data directory and restart the joiner.
2. Log entry “Conflicting state transfer”: Occurs when multiple nodes try to act as the donor. Solution: Explicitly set the wsrep_sst_donor variable on the joiner node.
3. Logical Block Errors: Inspect using lsns or lsof -i :4567 to ensure the port is not being held by a defunct process or a containerized instance of the database.
OPTIMIZATION & HARDENING
Performance Tuning:
To maximize throughput, adjust the wsrep_slave_threads variable to match the number of available CPU cores. This allows the node to apply write-sets in parallel. Fine-tune the gcs.fc_limit to prevent unnecessary Flow Control pauses; this is especially important in environments with high latency variations.
Security Hardening:
Replication traffic should be encrypted using SSL/TLS by setting wsrep_provider_options=”socket.ssl_key=…;socket.ssl_cert=…;”. Firewall rules must be strictly defined; use iptables or firewalld to restrict access to ports 4567 and 4568 to cluster member IPs only. This reduces the attack surface and prevents unauthorized state injection.
Scaling Logic:
When expanding the cluster, ensure that the thermal-inertia of the server room is managed, as peak replication loads can spike CPU temperatures. Use ProxySQL or MaxScale to distribute concurrency across the nodes. This allows for horizontal scaling while masking node maintenance or troubleshooting activities from the application layer.
THE ADMIN DESK
How do I fix a “Node not safe to bootstrap” error?
Locate the node with the highest seqno in /var/lib/mysql/grastate.dat. Set safe_to_bootstrap: 1 in that file, then execute galera_new_cluster on that node. This forces the node to become the primary component.
What causes frequent cluster “Flow Control” pauses?
This is typically caused by disk I/O bottlenecks or network latency. Check the wsrep_local_recv_queue; if it is high, the node is failing to apply writes. Upgrade storage to NVMe or increase wsrep_slave_threads to improve throughput.
Why is my SST (State Snapshot Transfer) failing?
SST failures are usually due to firewall blocks on port 4444 or incorrect sstuser credentials. Verify the donor and joiner can communicate over port 4444 and that the mariabackup or rsync package is installed on both nodes.
Can I run a Galera cluster with only two nodes?
It is not recommended due to “split-brain” risks. In a two-node setup, a network failure leaves both nodes unable to determine who holds the quorum. Always use a third node or a “Galera Arbitrator” (garbd) to maintain a majority.
How does network packet-loss affect the cluster?
Even 1 percent packet-loss can trigger constant re-elections and node evictions. Use tc qdisc to analyze network health. In consistent high-loss environments, increase the evs.suspend_timeout and evs.inactive_timeout in the wsrep_provider_options to provide more stability.



