PostgreSQL Streaming Replication

Building a Real Time Replica for Your PostgreSQL Data

PostgreSQL Streaming Replication serves as the cornerstone for high availability and disaster recovery within critical digital infrastructure. In industrial sectors such as energy grid management or municipal water distribution; data persistence is not merely a software requirement but a safety mandate. The primary challenge in these environments involves the prevention of data loss during hardware failure or network partitions. Streaming replication addresses this by continuously transferring Write-Ahead Log (WAL) records from a primary server to one or more standby replicas. This process ensures that the standby remains a transactionally consistent copy of the primary node with minimal latency. By implementing this protocol; architects eliminate single points of failure and provide a mechanism for read-scaling; allowing the primary node to focus on write-heavy ingestion while replicas handle complex analytical queries. The following manual outlines the rigorous engineering standards required to deploy a resilient replication cluster.

TECHNICAL SPECIFICATIONS

| Requirement | Default Port/Operating Range | Protocol/Standard | Impact Level (1-10) | Recommended Resources |
| :— | :— | :— | :— | :— |
| PostgreSQL 12 to 16 | 5432 (TCP) | TCP/IP / WAL | 10 | 16GB RAM / 8 vCPU |
| Network Throughput | 1 Gbps Minimum | IEEE 802.3ab | 8 | Low-latency Fiber |
| Disk I/O (IOPS) | 5000+ IOPS | NVMe/SSD | 9 | RAID 10 Array |
| Kernel Version | Linux 5.x or higher | POSIX | 7 | 64-bit Architecture |
| Secure Access | Port 22 (SSH) | OpenSSH 8.0+ | 6 | RSA 4096-bit Keys |

THE CONFIGURATION PROTOCOL

Environment Prerequisites:

Successful deployment requires two distinct Linux instances (Primary and Standby) with synchronized clocks via chrony or ntp. PostgreSQL must be installed on both nodes; and version parity is mandatory to prevent binary incompatibility. Administrative access via sudo or the root user is required; though all database operations must be executed as the postgres system user. Ensure that the network firewall allows bidirectional traffic on the default database port and that internal routing handles packet-loss mitigation to maintain a stable heartbeat between nodes.

Section A: Implementation Logic:

The engineering logic behind PostgreSQL Streaming Replication centers on the concept of Write-Ahead Logging (WAL) encapsulation. Every change to the database is recorded in a log before being applied to the data files. In a streaming setup; the primary node acts as a log server; while the standby acts as a consumer. This architecture is inherently idempotent: if the standby loses connection; it can resume from the last successfully received Byte Location (LSN) once the link is restored. This design reduces overhead by avoiding constant disk polling; instead using a push-oriented mechanism that minimizes the “gap” between primary and replica states.

Step-By-Step Execution

1. Primary Node Connectivity Configuration

On the primary server; navigate to the configuration directory: /etc/postgresql/15/main/ and modify the postgresql.conf file. Use a text editor to set the following variables:
listen_addresses = ‘*’
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10

System Note: These changes modify the shared memory segments of the kernel. Setting the wal_level to replica ensures the database includes sufficient information in the WAL files to support archiving and replication; increasing the payload size of each log entry slightly to accommodate metadata.

2. Authentication and Access Control

Edit the pg_hba.conf file on the primary node to permit the standby server to connect for replication purposes. Add the following entry at the end of the file:
host replication replication_user 192.168.1.50/32 scram-sha-256

System Note: The pg_hba.conf file acts as the primary gatekeeper for the PostgreSQL daemon. By specifying the /32 subnet mask; we enforce strict IP-based filtering; effectively hardening the service against unauthorized connection attempts from adjacent network segments.

3. Creation of the Replication Identity

Access the PostgreSQL terminal as the superuser and execute the command to create a dedicated replication role:
CREATE ROLE replication_user WITH REPLICATION PASSWORD ‘secure_password’ LOGIN;

System Note: This command creates a specialized role with the REPLICATION attribute. This attribute bypasses standard table-level permissions to allow the streaming of raw WAL blocks; maintaining the principle of least privilege by not granting full superuser status.

4. Standby Node Data Sanitization

On the standby server; stop the PostgreSQL service and clear the existing data directory to allow for a fresh synchronization:
systemctl stop postgresql
rm -rf /var/lib/postgresql/15/main/

System Note: Stopping the service via systemctl ensures that all file descriptors are closed and the process ID is cleared from the kernel task list. Deleting the data directory is necessary because the pg_basebackup utility requires an empty target path to ensure block-level consistency.

5. Executing the Base Backup

While still on the standby server; use the pg_basebackup utility to pull a full copy of the primary database:
pg_basebackup -h 192.168.1.10 -D /var/lib/postgresql/15/main/ -U replication_user -P -R

System Note: The -R flag is critical; it automatically generates the standby.signal file and populates postgresql.auto.conf with the connection details of the primary. This automation prevents manual configuration errors that often lead to “split-brain” scenarios.

6. Verification of Standby Readiness

Start the PostgreSQL service on the standby and check for the existence of the signal file:
systemctl start postgresql
ls /var/lib/postgresql/15/main/standby.signal

System Note: The presence of standby.signal tells the PostgreSQL engine to boot in recovery mode. Instead of accepting writes; the engine will continuously poll the primary server for new WAL segments; maintaining a read-only state.

Section B: Dependency Fault-Lines:

Replication often fails due to version mismatch or clock drift. If the standby clock lags behind the primary; the application of WAL records may trigger timestamp-based conflicts. Furthermore; signal-attenuation in the physical network layer can cause the WAL sender process to time out. Ensure that the max_wal_senders count on the primary is greater than the total number of connected replicas; or the primary will refuse new recovery connections. Additionally; if the wal_keep_size is insufficient; the primary may recycle log segments before the standby can fetch them; necessitating a full re-synchronization of the database.

THE TROUBLESHOOTING MATRIX

Section C: Logs & Debugging:

The primary tool for debugging replication is the PostgreSQL log file; typically located at /var/log/postgresql/postgresql-15-main.log. Monitoring these logs will reveal specific error codes such as “requested WAL segment has already been removed.” This indicates that the primary server purged the logs before the standby could process them.

To verify state in real-time; execute this query on the primary:
SELECT * FROM pg_stat_replication;

If the result set is empty; the standby is not connected. Check for firewall blocks using tcpdump -i eth0 port 5432. If the state column shows “streaming” but the reply_lsn is far behind the sent_lsn; you are experiencing replication lag. This is often caused by disk I/O bottlenecks on the standby or insufficient network throughput. In industrial environments; inspect the hardware via iostat -x to see if “percent utilization” is hitting 100 on the storage device; which indicates a physical limit on how fast the standby can write the incoming stream.

OPTIMIZATION & HARDENING

– Performance Tuning: To manage high concurrency; tune the max_worker_processes and max_parallel_workers settings in postgresql.conf. This ensures the kernel can distribute the replication overhead across multiple CPU cores. For databases with high thermal-inertia in high-density racks; consider using asynchronous replication to reduce the “wait time” for the primary; thereby preventing CPU spikes that increase server room heat.

– Security Hardening: Never use the database superuser for replication. Use TLS/SSL certificates for all data in transit to prevent packet-sniffing. Apply iptables or nftables rules to limit connections to the primary node to only specific internal IP addresses; effectively air-gapping the replication traffic from the public-facing application layer.

– Scaling Logic: As the workload grows; implement a Cascading Replication model. In this setup; a primary streams to a “Lead Standby;” which then streams to several “Sub-Standbies.” This reduces the CPU and network overhead on the primary node; as it only has to manage one WAL sender process regardless of the number of final replicas.

THE ADMIN DESK

How do I check if my standby is in read-only mode?
Connect to the standby via psql and run SELECT pg_is_in_recovery();. If the result is t (true); the server is correctly operating as a replica and will reject all direct write operations to protect data integrity.

What happens if the primary server fails?
You must manually promote the standby by running pg_ctl promote or by creating a trigger file defined in the configuration. Once promoted; the standby becomes a primary and begins accepting write transactions; assuming the master role for the cluster.

Can I replicate between different major versions?
No. PostgreSQL Streaming Replication requires the same major version (e.g., both must be v15). For cross-version migration; you must use Logical Replication; which operates on a per-table basis rather than the block-level WAL streaming used here.

How do I minimize replication lag on high-traffic nodes?
Increase the wal_receiver_status_interval and ensure your standby hardware matches the primary. Latency is often a product of “disk wait” on the standby. Upgrading to NVMe storage will drastically improve the speed at which the replica applies WAL records.

Why is my pg_wal directory growing so fast?
The primary keeps WAL files until they are confirmed as received by the standby. If a standby is disconnected; the primary will accumulate WAL files; potentially filling the disk. Monitor the pg_stat_replication_slots view to identify stalled consumers.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top