Is OSM partially down? New data isn't able to be queried on the map, and new data isn't being rendered

I’m finding that I can pull data into JOSM and iD that I know exists (due to recently pushed edits), but a query on the map doesn’t list them (not to mention that they also aren’t being rendered; however, when something’s awaiting rendering, I always, at least by my experience, can still query it).

For example, this query should list building=toilets, but it doesn’t. If I edit that location in iD I can see the aforementioned feature, and I can also pull it into JOSM, but it isn’t being rendered, and the data isn’t present in the map via a query.

Update: I appear to be able to see the edits in the history, but I’m still not able to query them, or see them render.

2 Likes

Same problem here :frowning:

1 Like

Data replication is broken Grafana

The Overpass API, tile rendering, OSMCha, and other tools use it. /query under the hood also uses the Overpass API

2 Likes

I don’t get it. So Was all my changes deleted? or Are My last changes in the databases but it’s not possible to access to them?

2 Likes

No just updates to third party sites and internal stuff (rendering, nominatim and the internal overpass instance) are not working (if it is just replication that is broken).

Yes, and a qualified no, you should be able to access them via the editing API which as the name says, is used by editors.

PS: we’re likely 2-3 hours before operations can do anything.

Hmm, it’s seeming like this is no longer the case. I pushed a change, and I can’t see it in iD: is_in is present in this query, and it’s visible in iD. It should not be, by the aformentioned changeset.

Well, the changeset seems to be fine, so maybe changeset ingestion is broken? if so, it might mean that the edits will be lost, if they’re all based on stale data. I have no idea if changesets includes the old state (I guess somehow they do, since conflicts are detected?). Maybe they can replay some changesets, but once you start having conflicts, not sure which options there are…

Speculation here: if the replication (nothing to do with the replication via diff files) between the database instances is broken the read-only database(s) will be out of sync*. IIRC this can be a problem for iD as it doesn’t update the loaded data in place after an upload, but discards the current data and reloads it, potentially retrieving said out of sync data from the read-only DB.

* however Grafana would seem to indicate that that is just fine.

The updating of Carto Standard, the OSM site reference, seems to be indeed sloooow. Fortunately with the Better-OSM-Org browser addon you get a visualization of what was uploaded, so you know you know.
This one about 7 minutes ago after another Ctrl+F5 in the windows browser.

The different zoom levels update at different pace, but that has always been. This one after 10 minutes at 50m zoom hasn’t with the forced view update

Anyway, Better-OSM-ORG demonstrates it’s all in.

But way/959011955 was not changed by your changeset. If you download osmChange XML, it will not contain such an element.

Judging by the fact that you have exactly 1000 objects in the changeset, did you happen to use uploading changes in parts? Maybe the upload was interrupted?

Confirmation of the issue: OpenStreetMap planet diffs are not currently updating.

The Ops team are trying to get this resolved. The issue is due to an unexpected issue with PostgreSQL and how osmdbt pulls the changes from the database.

The PostgreSQL database is operating normally and all map changes are being saved in the database. Our PostgreSQL site-to-site replication remains working as expected.

18 Likes

It is expected OpenStreetMap.org will be offline briefly for maintenance later. We have to stop some systems briefly to get the OpenStreetMap planet diffs (minutely / hourly / daily) back into sync with the database.

I will try my best to announce the maintenance window via OpenStreetMap Ops Team (@osm_tech@en.osm.town) - OSM Town | Mapstodon for OpenStreetMap and possibly here.

12 Likes

Progress on restoring the publish of planet diffs is being made, but care is being taken to ensure that only the expected data is published and we don’t miss anything.

Background: planet diffs (minutely, hourly, daily) are used by many downstream services. eg: rendered map on OSM.org (tile.osm.org), nominatim, overpass (query tool) and most other services use the OSM planet diffs to receive a live feed of changes from the main OSM database.

7 Likes

OpenStreetMap.org will be offline for the next hour due to maintenance while we fix an issue with OSM planet replication diff publishing. Sorry for the short notice. We will return services as soon as possible. maintenance osm openstreetmap

10 Likes

OpenStreetMap.org maintenance has been completed. OpenStreetMap planet replication diffs are being published again. Ops will continue to monitor.

26 Likes

The OSM tile renderings (tile.osm.org) & Overpass (OSM query tool) has now caught-up to current OSM edits (via diffs).

Nominatim is still catching up.

Sources:

  1. Grafana
  2. Grafana
  3. Grafana

Other services, including those hosted by others may take longer to catch up.

11 Likes

Thank you for the fix.

I wonder if the true root cause has already been identified. Is there a planned release of the postmortem report yet?

See You're invited to talk on Matrix

You can’t see earlier messages : You don’t have permission to view messages from before you joined.

Fortunately, I’ve been in the #osmf-operations channel the entire time, so I tried to understand the situation by reviewing the chat logs. Here’s what I found (please correct me if I’m wrong anywhere).


On June 26th, 2025, the OpenStreetMap planet database’s replication pipeline came to an unexpected halt. At around 21:32 UTC, replication stopped because the PostgreSQL cluster hit a hard limit while processing its logical replication stream. The system encountered a record in the replication data that exceeded PostgreSQL’s built-in 1 GB per-field size limit, which caused the pg_logical_slot_peek_changes function to fail each time it attempted to decode new changes.

Logical replication streams row-level changes (INSERTs, UPDATEs, DELETEs) from the main database to downstream consumers using replication slots. When a single record or transaction becomes unusually large, the decoder must hold that data in memory to rebuild the changes — but PostgreSQL enforces a hard cap of 1 GB for any single field. Once this limit is exceeded, replication fails. In this incident, manual attempts to push the backlog forward worked for a while but consistently failed when they reached the oversized record.

A closer look revealed that this is not an isolated problem. Another PostgreSQL user reported a similar issue on StackExchange: logical replication crashed with an “out of memory” error inside the ReorderBuffer — the internal structure that holds decoded changes before they are sent downstream. Their logs showed a huge allocation request:

ERROR: out of memory
DETAIL: Failed on request of size 261,265,456 in memory context “ReorderBuffer”.

Both issues share the same root cause: logical replication depends on storing row-level changes and metadata in memory. Unexpectedly large rows or runaway metadata can easily push the system over its built-in memory limits. Some users tried increasing wal_decode_buffer_size — which controls how much memory PostgreSQL uses to decode WAL data — from the default 512KB to 512MB. But this sometimes made things worse, triggering errors like “invalid memory alloc request size 1.4GB” instead.

So what made this more likely to happen now?

A suspected recent patch, backported to PostgreSQL 15.13 and deployed on May 30, fixed a longstanding bug that could silently lose changes in logical replication. The original issue occurred when certain DDL operations — like ALTER PUBLICATION or ALTER TYPE — modified the system catalog without properly invalidating cache data used by concurrent transactions. Without this invalidation, logical decoding could reuse stale metadata and skip changes that should have been replicated. The patch improved correctness by broadcasting invalidation messages to all in-progress transactions so they refresh their caches properly. However, this fix came with an unintended side effect. Each transaction now needed to track more invalidation messages in memory. More critically, a flaw in that patch caused transactions to redistribute not just their own invalidation messages but also those they had received from others. This created an exponential feedback loop: invalidation messages multiplied rapidly across transactions, ballooning memory usage in the ReorderBuffer and triggering large memory allocations or records that exceeded the 1 GB per-field limit.

PostgreSQL developers addressed this in a follow-up commit (d87d07b7ad3b7). That fix stops transactions from redistributing invalidation messages they received from peers, breaking the exponential growth loop. It also adds a safeguard that caps the total size of distributed invalidation messages per transaction at 8MB. If that limit is exceeded, PostgreSQL invalidates all caches outright instead of hoarding metadata indefinitely. This keeps logical decoding within safe memory boundaries and prevents runaway growth that can cause exactly this kind of failure. But according to the schedule, the new release that will include this fix will be on August 14th, 2025.

"did you see joto’s latest in #osm-dev? seems like we can try that but will need to take the site readonly…
“that works for me”

After the logical replication crashed due to the oversized record, the OSM operations team needed to get the database and replication system back to a consistent state without corrupting downstream minutely diffs or allowing edits that could be lost. To do this, they performed a controlled recovery process.

First, they decided to temporarily put the site into read-only mode, stopping new edits from users while the fix was applied. This ensured no further changes would be added to the database while they rebuilt the replication state. The staff then created a fresh full database dump (osm-2025-06-23.dmp) and used internal tools to process it on another server, which generated a clean snapshot of the database contents as the known good starting point.

Next, they generated “fake logs” and carefully edited the first two to remove any duplicate data that might have been introduced during the replication gap.

When replication breaks mid-stream (like when the logical replication slot hits the 1 GB limit and crashes), the system ends up in a partially replayed state. Some changes might already exist in the primary database, but the replication logs that downstream consumers rely on (like minutely diffs) may not have fully recorded them — or worse, may have partially recorded them before the crash. When the OSM team rebuilt the missing diffs from a fresh full database dump, they effectively re-exported all the current data in the database, which included edits that might have already appeared in earlier logs. So the “fake logs” were generated to fill the replication gap safely, but the first parts of these logs needed to be manually edited to remove any overlapping data that had already been published before the crash. This step ensures there’s exactly one clean, continuous history of changes — no gaps, no missing edits, but also no duplicates that would break the downstream consumers’ state.

These cleaned logs were then synced to S3 storage to make them available to other systems that consume the diffs. After that, they rebuilt and published the missing minutely diffs, pushing them sequentially (diff 666 up through 674) until the replication history was complete and up to date on S3.

Once all the repaired diffs were published, the team re-enabled the replication system and ran a test replication cycle to confirm there were no errors. Seeing no issues, they lifted the read-only restriction and reopened the database for edits. They closely watched the system as the first new diff appeared (675.osc.gz), confirming that minutely replication and downstream tile rendering were catching up properly.

The OSM team’s approach of making the database read-only, taking a fresh full dump, and generating “fake logs” serves as an effective workaround for the PostgreSQL logical replication bug because it sidesteps the fragile part of the system: the stuck replication slot. Logical replication works by streaming row-level changes in order, and if any record becomes too large or the in-memory decoding metadata grows excessively — as happened here due to the invalidation message bug — the replication slot gets stuck and cannot continue processing.

By taking the site read-only, the team ensures there are no new edits while they rebuild. Then, by creating a clean, consistent dump of the entire database, they capture a known-good snapshot that includes all changes, even those that may have partially gone through the broken slot. Rebuilding the missing diffs from this dump allows the team to recreate the full replication history without relying on the corrupted slot. Editing the first few logs removes any duplicate data that might have already been published before the crash, which avoids double-applying changes downstream.

Once the replacement diffs are published, the stuck slot can safely be discarded and replication can restart from a clean state, free of oversized transactions that would trigger the same bug.

3 Likes

Few correction notes:

  1. During the event planetdump-ng was already normally creating a new weekly planet dump, this was not used as part of the replication diff recovery. The weekly pg_dump backups are the input used for to the weekly planetdump-ng planet exports. The pg_dump backup which starts early Monday and finished late Wednesday evening, ending approximately an hour before the osmdbt failure.

  2. fakelog is part of osmdbt, it was used to create state log files used as part of osmdbt diff generation process. The output logs from fakelog had to be manually edited to remove changesets which had already been published in replication diffs.

2 Likes