Is OSM partially down? New data isn't able to be queried on the map, and new data isn't being rendered

rphyrin · June 28, 2025, 9:45am

You can’t see earlier messages : You don’t have permission to view messages from before you joined.

Fortunately, I’ve been in the #osmf-operations channel the entire time, so I tried to understand the situation by reviewing the chat logs. Here’s what I found (please correct me if I’m wrong anywhere).

On June 26th, 2025, the OpenStreetMap planet database’s replication pipeline came to an unexpected halt. At around 21:32 UTC, replication stopped because the PostgreSQL cluster hit a hard limit while processing its logical replication stream. The system encountered a record in the replication data that exceeded PostgreSQL’s built-in 1 GB per-field size limit, which caused the pg_logical_slot_peek_changes function to fail each time it attempted to decode new changes.

Logical replication streams row-level changes (INSERTs, UPDATEs, DELETEs) from the main database to downstream consumers using replication slots. When a single record or transaction becomes unusually large, the decoder must hold that data in memory to rebuild the changes — but PostgreSQL enforces a hard cap of 1 GB for any single field. Once this limit is exceeded, replication fails. In this incident, manual attempts to push the backlog forward worked for a while but consistently failed when they reached the oversized record.

A closer look revealed that this is not an isolated problem. Another PostgreSQL user reported a similar issue on StackExchange: logical replication crashed with an “out of memory” error inside the ReorderBuffer — the internal structure that holds decoded changes before they are sent downstream. Their logs showed a huge allocation request:

ERROR: out of memory
DETAIL: Failed on request of size 261,265,456 in memory context “ReorderBuffer”.

Both issues share the same root cause: logical replication depends on storing row-level changes and metadata in memory. Unexpectedly large rows or runaway metadata can easily push the system over its built-in memory limits. Some users tried increasing wal_decode_buffer_size — which controls how much memory PostgreSQL uses to decode WAL data — from the default 512KB to 512MB. But this sometimes made things worse, triggering errors like “invalid memory alloc request size 1.4GB” instead.

So what made this more likely to happen now?

A suspected recent patch, backported to PostgreSQL 15.13 and deployed on May 30, fixed a longstanding bug that could silently lose changes in logical replication. The original issue occurred when certain DDL operations — like ALTER PUBLICATION or ALTER TYPE — modified the system catalog without properly invalidating cache data used by concurrent transactions. Without this invalidation, logical decoding could reuse stale metadata and skip changes that should have been replicated. The patch improved correctness by broadcasting invalidation messages to all in-progress transactions so they refresh their caches properly. However, this fix came with an unintended side effect. Each transaction now needed to track more invalidation messages in memory. More critically, a flaw in that patch caused transactions to redistribute not just their own invalidation messages but also those they had received from others. This created an exponential feedback loop: invalidation messages multiplied rapidly across transactions, ballooning memory usage in the ReorderBuffer and triggering large memory allocations or records that exceeded the 1 GB per-field limit.

PostgreSQL developers addressed this in a follow-up commit (d87d07b7ad3b7). That fix stops transactions from redistributing invalidation messages they received from peers, breaking the exponential growth loop. It also adds a safeguard that caps the total size of distributed invalidation messages per transaction at 8MB. If that limit is exceeded, PostgreSQL invalidates all caches outright instead of hoarding metadata indefinitely. This keeps logical decoding within safe memory boundaries and prevents runaway growth that can cause exactly this kind of failure. But according to the schedule, the new release that will include this fix will be on August 14th, 2025.

"did you see joto’s latest in #osm-dev? seems like we can try that but will need to take the site readonly…
“that works for me”

After the logical replication crashed due to the oversized record, the OSM operations team needed to get the database and replication system back to a consistent state without corrupting downstream minutely diffs or allowing edits that could be lost. To do this, they performed a controlled recovery process.

First, they decided to temporarily put the site into read-only mode, stopping new edits from users while the fix was applied. This ensured no further changes would be added to the database while they rebuilt the replication state. The staff then created a fresh full database dump (osm-2025-06-23.dmp) and used internal tools to process it on another server, which generated a clean snapshot of the database contents as the known good starting point.

Next, they generated “fake logs” and carefully edited the first two to remove any duplicate data that might have been introduced during the replication gap.

When replication breaks mid-stream (like when the logical replication slot hits the 1 GB limit and crashes), the system ends up in a partially replayed state. Some changes might already exist in the primary database, but the replication logs that downstream consumers rely on (like minutely diffs) may not have fully recorded them — or worse, may have partially recorded them before the crash. When the OSM team rebuilt the missing diffs from a fresh full database dump, they effectively re-exported all the current data in the database, which included edits that might have already appeared in earlier logs. So the “fake logs” were generated to fill the replication gap safely, but the first parts of these logs needed to be manually edited to remove any overlapping data that had already been published before the crash. This step ensures there’s exactly one clean, continuous history of changes — no gaps, no missing edits, but also no duplicates that would break the downstream consumers’ state.

These cleaned logs were then synced to S3 storage to make them available to other systems that consume the diffs. After that, they rebuilt and published the missing minutely diffs, pushing them sequentially (diff 666 up through 674) until the replication history was complete and up to date on S3.

Once all the repaired diffs were published, the team re-enabled the replication system and ran a test replication cycle to confirm there were no errors. Seeing no issues, they lifted the read-only restriction and reopened the database for edits. They closely watched the system as the first new diff appeared (675.osc.gz), confirming that minutely replication and downstream tile rendering were catching up properly.

The OSM team’s approach of making the database read-only, taking a fresh full dump, and generating “fake logs” serves as an effective workaround for the PostgreSQL logical replication bug because it sidesteps the fragile part of the system: the stuck replication slot. Logical replication works by streaming row-level changes in order, and if any record becomes too large or the in-memory decoding metadata grows excessively — as happened here due to the invalidation message bug — the replication slot gets stuck and cannot continue processing.

By taking the site read-only, the team ensures there are no new edits while they rebuild. Then, by creating a clean, consistent dump of the entire database, they capture a known-good snapshot that includes all changes, even those that may have partially gone through the broken slot. Rebuilding the missing diffs from this dump allows the team to recreate the full replication history without relying on the corrupted slot. Editing the first few logs removes any duplicate data that might have already been published before the crash, which avoids double-applying changes downstream.

Once the replacement diffs are published, the stuck slot can safely be discarded and replication can restart from a clean state, free of oversized transactions that would trigger the same bug.