UPDATE 17 December 2024 1:45pm (GMT/UTC): We are back up and running. Happy Mapping.
OpenStreetMap.org and a number of related services are currently offline. 15 December 2024 starting approximately 4:00AM (GMT/UTC).
We have an ISP outage affecting our servers in Amsterdam. Our ISP has an engineer on-route to fix the issue.
We will continue to monitor. If the ISP update us with an ETA / Estimated Time of Resolution, I will update with the additional information.
UPDATE 16 December 2024 12:00 Midday (GMT/UTC): We are hoping to have services fully restored on Wednesday 18 December 2024 based on our expectation of when our ISP will have restored services. We have chosen to wait for the ISP restore rather than activating our higher risk disaster recover plan. In parallel we are provisioning new ISP services from a separate ISP.
UPDATE 17 December 2024 12:00 Midday (GMT/UTC): Our new ISP is up and running and we have started migrating our servers across to it. If all goes smoothly we hope to have all services back up and running this evening. Our old ISP is still down. They have shipped replacement equipment and it is expected to arrive tomorrow, then they need to install and configure before they can return connectivity to us.
OpenStreetMap.org website and API are still read-only. You will not be able to save map changes while the website and API are read-only.
We are waiting for our Internet Service Provider (ISP) to fix their equipment at our primary hosting location. They have been unable to provide us with any time estimates for repairing the connection. This is a unusual situation for us. All our equipment is ok.
After leading several major incident service recovery teams, the details of a real-world failure are always more complex than anyone expects.
I’d suggest patience until the hosting provider issues a root-cause report, and this can be used to review the overall architecture against availability goals, and yes, costs.
It is easy to judge, but we tried to build something reliable, cost effective and simple to manage. We have learnt.
We have dual redundant links via separate physical hardware from our side to our Tier 1 ISP. We unexpectedly discovered their equipment is a single point failure. Their extended outage is an extreme disappointment to us.
We are an extremely small team. The OSMF budget is tiny and we could definitely use more help. Real world experience.
We don’t run BGP because we don’t feel comfortable with the leap in requirements, cost (proper BGP capable routers) and we as a team don’t have deep experience with it. We also don’t have RIPE membership, or our own ASN (lease?) or our own IPv4 subnet (lease?)
Ironically we signed a contract with a new ISP in the last few days. Install is on-going (fibre runs, modules & patching) and we expect to run old and new side-by-side for 6 months. Significantly better resilience (redundant ISP side equipment, VRRP both ways, multiple upstream peers… 2x diverse 10G fibre links).
I am coming from the ISP World (Have been with Telefonica in Germany for 15 years and did Access and Core Routing stuff etc).
I am pretty astonished that even in case of a Hardware failure we are affected for 18 hours now. Typically you have spare part handling contracts 365x24x5 so your Service Partner like Computacenter/Dimension Data/NTT have 5 hours to get replacement Hardware on site.
Our Tier 1 ISP has just confirmed hardware failure of the upstream router (after 16 hours!) and will provision new replacement device once it has been shipped to site.
Splutter - I damn hope not. We are looking at the practicality of syncing the last few writes out of our primary database (Amsterdam) to secondary (Dublin) over our small out-of-band 4G modem + VPN backup link so that we can do a full site-failover and re-enable writes. No decision has been made yet.
Look on the bright side!
I hear you got new candidates to volunteer to the team as @flohoff and @ramthelinefeed show an interest in helping you out with their experience!
I am sure many of us can help in that, if nobody else you can always ping me.
Full BGP capable routers are not necessarily expensive, it’s just you don’t have to think that “business grade black box equipment” is the only way to go (without mentioning Brand Names™; though it’s possible to find some used stuff for reasonable prices as well). There are a lot of interesting stuff out there ranging from Linux based routers using professional network cards to pretty cheap Mikrotik based routers (unless you push through extreme amount of traffic, which you don’t, according to your graphs on prometheus.osm.org). And you may not even need full BGP. (And of course doing redundancy in-house is also possible, either with a spare on a shelf or live.)
Getting an ASN is pretty simple and may be even free, and you do not have to be a LIR. Same goes for IP subnet (you do not need to be LIR but it may be expensive unless you get some from a friendly entity, but you can always get “infinite” amount of IPv6 for free).
I do actually earn my living as a compliance manager, cheeky chops.
Keep trying to persaude the infrastructure guys where I work to let me get them ISO 22301 certified. But as yet they keep wriggling out of it
My bet - between 10-11 o’Clock today it starts to move Typically - either you get spare parts within 5 hours 24x7 - or “Next business day” - So i guess somebody at the ISP saved some money in spare part handling