Stuck Replication Slot
Issue
A PostgreSQL replication slot is stuck.
Environment
Connector: PostgreSQL
Recommendations
A PostgreSQL replication slot may be stuck due to one of the following reasons:
The replication slot returned no items in the 24-hour timeout Fivetran defines after requesting the WAL stream.
The size of the replication slot is in hundreds of GB, approaching 1 TB.
PostgreSQL cannot replay the WAL in memory and must write to disk before streaming the output to Fivetran. Learn more in AWS' Tuning memory parameters for Aurora PostgreSQL article. It's written for Aurora, but the concepts apply to all PostgreSQL-hosting platforms.
Set
max_slot_wal_keep_size
to a value that is large enough to handle any bulk updates done on the source database, but small enough that a problem replication slot will not bloat and exceed the size of the disk. See the PostgreSQLCO.NF site for a more detailed description. We recommend a minimum of two days' worth of retention. See the section below on tuningmax_wal_size
for how to measure the amount of WAL generated in a given time.logical_decoding_work_mem
specifies the maximum amount of memory to be used by logical decoding. Set this value high enough to allow PostgreSQL to hold the LSN range requested by Fivetran in memory. The default is 64MB, which is insufficient for high transaction production databases. Setting it to 4GB or higher helps to ensure PostgreSQL can replay transactions in the requested LSN range in memory.Check if the
bgwriter
(background writer) process is healthy by inspecting its stats. Execute the following query:SELECT * FROM pg_stat_bgwriter;
.If the
maxwritten_clean
value is excessively high for your environment, increase thebgwriter_lru_maxpages
parameter. In a well-tuned environment, themaxwritten_clean
value should be 0.If the
buffers_backend
value is larger than thebuffers_clean
value, increase thebgwriter_lru_multiplier
parameter and decrease thebgwriter_delay
parameter.NOTE: The above condition may also indicate insufficient shared buffers. The hot part of your data is forced to travel between the RAM and disks.
The
buffers_backend_fsync
value indicates if the backend was forced to make its ownfsync
requests to synchronize the buffers with storage. A value above 0 points to problems with the storage when thefsync
queue is completely filled.
If you change the values of any of the above parameters, reset the
bgwriter
stats. Execute the following query:pg_stat_reset_shared('bgwriter');
. Check the stats again the next day.Check if the
max_wal_size
value is high enough to be rarely reached within your definedcheckpoint_timeout
window.For PostgreSQL versions 9.6 and below, do the following:
- Get LSN1. Execute:
postgres=# SELECT pg_current_xlog_insert_location();
. - Wait for the length of
checkpoint_timeout
value. - Get LSN2. Execute:
postgres=# SELECT pg_current_xlog_insert_location();
. - Get the total amount of WAL written during the checkpoint period. Execute:
postgres=# SELECT pg_xlog_location_diff('LSN2_VALUE', 'LSN1_VALUE');
. - Multiply the result from Step 4 by three.
- Collect a few data points. Repeat Steps 1 to 5 during a period of heavy database activity.
- Update
max_wal_size
to the result of Step 5.
- Get LSN1. Execute:
For PostgreSQL versions 9.7 and above, do the following:
- Get LSN1. Execute:
postgres=# SELECT pg_current_wal_insert_lsn();
. - Wait for the length of
checkpoint_timeout
value. - Get LSN2. Execute:
postgres=# SELECT pg_current_wal_insert_lsn();
. - Get the total amount of WAL written during the checkpoint period. Execute:
postgres=# SELECT pg_wal_lsn_diff('LSN2_VALUE', 'LSN1_VALUE');
. - Multiply the result from Step 4 by three.
- Collect a few data points. Repeat Steps 1 to 5 during a period of heavy database activity.
- Update
max_wal_size
to the result of Step 5.
- Get LSN1. Execute:
Increase
wal_buffers
. On very busy, high-core machines, it is useful to increase thewal_buffers
value to as high as 128 MB. Consult your DBA before increasing this value because the buffer value depends on the database's hardware capacity. Fivetran recommends a minimum of 16 MB.If the above steps do not resolve the stuck replication slot, you must perform the following steps:
- Pause the connector.
- Drop the replication slot.
- Wait for a checkpoint (dependent on the
checkpoint_timeout
value). - Recreate the replication slot.
- Ensure replication is in place. Execute:
SELECT * FROM pg_logical_slot_peek_changes('slot_name', NULL, 1);
.TIP: If the query fails, it indicates that replication is not in place.
- Go to the Setup tab of the connector, and click the Re-sync All Historical Data link.
- Unpause the connector.