Stuck Replication Slot

Issue

A PostgreSQL replication slot is stuck.

Environment

Connector: PostgreSQL

Recommendations

A PostgreSQL replication slot may be stuck due to one of the following reasons:

The replication slot returned no items in the 24-hour timeout Fivetran defines after requesting the WAL stream.
The size of the replication slot is in hundreds of GB, approaching 1 TB.
PostgreSQL cannot replay the WAL in memory and must write to disk before streaming the output to Fivetran. Learn more in AWS' Tuning memory parameters for Aurora PostgreSQL article. It's written for Aurora, but the concepts apply to all PostgreSQL-hosting platforms.
Set max_slot_wal_keep_size to a value that is large enough to handle any bulk updates done on the source database, but small enough that a problem replication slot will not bloat and exceed the size of the disk. See the PostgreSQLCO.NF site for a more detailed description. We recommend a minimum of two days' worth of retention. See the section below on tuning max_wal_size for how to measure the amount of WAL generated in a given time.
logical_decoding_work_mem specifies the maximum amount of memory to be used by logical decoding. Set this value high enough to allow PostgreSQL to hold the LSN range requested by Fivetran in memory. The default is 64MB, which is insufficient for high transaction production databases. Setting it to 4GB or higher helps to ensure PostgreSQL can replay transactions in the requested LSN range in memory.
Check if the bgwriter (background writer) process is healthy by inspecting its stats. Execute the following query: SELECT * FROM pg_stat_bgwriter;.
- If the maxwritten_clean value is excessively high for your environment, increase the bgwriter_lru_maxpages parameter. In a well-tuned environment, the maxwritten_clean value should be 0.
- If the buffers_backend value is larger than the buffers_clean value, increase the bgwriter_lru_multiplier parameter and decrease the bgwriter_delay parameter.
  The above condition may also indicate insufficient shared buffers. The hot part of your data is forced to travel between the RAM and disks.
- The buffers_backend_fsync value indicates if the backend was forced to make its own fsync requests to synchronize the buffers with storage. A value above 0 points to problems with the storage when the fsync queue is completely filled.
If you change the values of any of the above parameters, reset the bgwriter stats. Execute the following query: pg_stat_reset_shared('bgwriter');. Check the stats again the next day.
Check if the max_wal_size value is high enough to be rarely reached within your defined checkpoint_timeout window.
- For PostgreSQL versions 9.6 and below, do the following:
  1. Get LSN1. Execute: postgres=# SELECT pg_current_xlog_insert_location();.
  2. Wait for the length of checkpoint_timeout value.
  3. Get LSN2. Execute: postgres=# SELECT pg_current_xlog_insert_location();.
  4. Get the total amount of WAL written during the checkpoint period. Execute: postgres=# SELECT pg_xlog_location_diff('LSN2_VALUE', 'LSN1_VALUE');.
  5. Multiply the result from Step 4 by three.
  6. Collect a few data points. Repeat Steps 1 to 5 during a period of heavy database activity.
  7. Update max_wal_size to the result of Step 5.
- For PostgreSQL versions 9.7 and above, do the following:
  1. Get LSN1. Execute: postgres=# SELECT pg_current_wal_insert_lsn();.
  2. Wait for the length of checkpoint_timeout value.
  3. Get LSN2. Execute: postgres=# SELECT pg_current_wal_insert_lsn();.
  4. Get the total amount of WAL written during the checkpoint period. Execute: postgres=# SELECT pg_wal_lsn_diff('LSN2_VALUE', 'LSN1_VALUE');.
  5. Multiply the result from Step 4 by three.
  6. Collect a few data points. Repeat Steps 1 to 5 during a period of heavy database activity.
  7. Update max_wal_size to the result of Step 5.
Increase wal_buffers. On very busy, high-core machines, it is useful to increase the wal_buffers value to as high as 128 MB. Consult your DBA before increasing this value because the buffer value depends on the database's hardware capacity. Fivetran recommends a minimum of 16 MB.
If the above steps do not resolve the stuck replication slot, you must perform the following steps:
1. Pause the connector.
2. Drop the replication slot.
3. Wait for a checkpoint (dependent on the checkpoint_timeout value).
4. Recreate the replication slot.
5. Ensure replication is in place. Execute: SELECT * FROM pg_logical_slot_peek_changes('slot_name', NULL, 1);.
  If the query fails, it indicates that replication is not in place.
6. Go to the Setup tab of the connector, and click the Re-sync All Historical Data link.
7. Unpause the connector.