My bet - between 10-11 o’Clock today it starts to move Typically - either you get spare parts within 5 hours 24x7 - or “Next business day” - So i guess somebody at the ISP saved some money in spare part handling
Just friendly reminder that high availability and redundancy cost extra, so if you can go to Donate – OpenStreetMap Foundation and donate what you can so maybe in future this can be prevented.
Let the team behind the scenes work in peace. They’ll know how to get out of this mess.
It’s going to take time and I trust them. Good success takes time …
And instead of trying to ‘make a name for yourself’ with ‘clever’ advice, you could help the family decorate the Christmas tree.
The Operations Team has taken the decision to wait for restore of ISP services at our primary site in Amsterdam. Our expectation is that services should be restored on Wednesday. We don’t have an ETA for ISP, so the time estimate is based on our predictions based on the communication with ISP.
While manually recovering services (postgres + planet diffs) to Dublin remains an option, we have decided for now not to activate this disaster recover scenario. The risks involved are not yet justified. Data integrity is our priority.
In parallel we are finalising the provisioning of new ISP services in Amsterdam and Dublin.
Just curious, who is operating the “admin” account?
Grant. My login token expired and I used a special admin login method. oAuth2 via OSM.org isn’t functioning because the tokens cannot be stored in a read-only database.
ISP are express shipping the equipment from California.
Per ship?
The good news here is that eventually all of our tokens will expire and you won’t have to hear complaints on the forum anymore
SSO - Single Silence, Offline.
Out of curiosity, can you explain why we can read the database and not write to it? There probably are lessons here for those of us who are involved in managing other platforms.
Considering how many services use our data, one would expect the infrastructure to work a little more efficiently. I’m curious how long it will take to fix, and even more curious what lessons will be learned.
We have primary (Amsterdam) and follower (Dublin). The primary data is synced to the followers (we have multiple). We use asynchronous replication because latency between Amsterdam and Dublin can vary a lot and synchronous would effect the speed at which changes upload. In addition we also have state data for planet diffs and internal planet diff tracking state.
When the uplink in Amsterdam failed not all the data had been synced to Dublin. A small amount of map changes will therefore only exist in Amsterdam (we also have a follower database in Amsterdam). If we force Dublin live then the data will be lost. Alternatively we could manually sync over a 4G + VPN link, but we have deemed this too high risk for the moment.
Summary: We are running Read-Only in Dublin because not all the map changes had been copied to Amsterdam when our Amsterdam connection stopped. If we force Dublin Read-Write again, we will lose some mapping data.
Tiny budget, tiny team. We do the best we can with the resources we have. We’d appreciate more help.
Is this the point at which we should think about not acting on a volunteer basis in this part of OSM?
Wow, my estimate wasn’t too far off. Tough time for anyone here who relies on OpenStreetMap for their daily lives, and especially those who were planning to update public transport lines for the annual schedule change for the 2025 schedule (which happened to fall on the exact day the database went down)
Out of curiosity, what are the data integrity risks involved?
I am the only full time employee of the OpenStreetMap Foundation. The OSMF is interested in hiring additional staff, but I believe they are constrained by financials.
Manual postgres WAL log recovery and manual osmdbt state recovery, while ensuring it continues at the right DB logical segment.
We’d also have to hack the 4G link to get it working for this purpose. The 4G modem link has a tiny bandwidth allowance, but can be topped up in 500MB blocks.
Though some mapping data is also lost by long downtime, that will be simply never mapped.
On the other hand throwing some unsynchronized data would I guess highly increase risk of inconsistencies (some edits left only partly done, some revert performed and vandalism restored by throwing them away, maybe also some other data would be lost - final registrations, blocks etc).
(note: I am not a highly experienced sysadmin)
I can confirm this.
(though note that I am not on OSMF board anymore - but I expect no dramatic changes here)