OpenStreetMap.org currently offline. Operations Team are working to restore - 15 December 2024 (Updated)

Firefishy · December 15, 2024, 4:53am

UPDATE 17 December 2024 1:45pm (GMT/UTC): We are back up and running. Happy Mapping.

~~OpenStreetMap.org and a number of related services are currently offline. 15 December 2024 starting approximately 4:00AM (GMT/UTC).~~

~~We have an ISP outage affecting our servers in Amsterdam. Our ISP has an engineer on-route to fix the issue.~~

~~We will continue to monitor. If the ISP update us with an ETA / Estimated Time of Resolution, I will update with the additional information.~~

UPDATE 16 December 2024 12:00 Midday (GMT/UTC): We are hoping to have services fully restored on Wednesday 18 December 2024 based on our expectation of when our ISP will have restored services. We have chosen to wait for the ISP restore rather than activating our higher risk disaster recover plan. In parallel we are provisioning new ISP services from a separate ISP.

UPDATE 17 December 2024 12:00 Midday (GMT/UTC): Our new ISP is up and running and we have started migrating our servers across to it. If all goes smoothly we hope to have all services back up and running this evening. Our old ISP is still down. They have shipped replacement equipment and it is expected to arrive tomorrow, then they need to install and configure before they can return connectivity to us.

Firefishy · December 15, 2024, 6:07am

Outage continues. Our ISP is working to resolve the issue.

Firefishy · December 15, 2024, 7:33am

Outage continues. I have asked ISP for an update on fix and am waiting to hear back.

admin · December 15, 2024, 12:15pm

OpenStreetMap.org will shortly be available again READ-ONLY via our backup servers. You will unfortunately NOT be able to save map edits.

We are still waiting for our ISP to restore services so that we can fully restore all map services.

admin · December 15, 2024, 3:59pm

OpenStreetMap.org website and API are still read-only. You will not be able to save map changes while the website and API are read-only.

We are waiting for our Internet Service Provider (ISP) to fix their equipment at our primary hosting location. They have been unable to provide us with any time estimates for repairing the connection. This is a unusual situation for us. All our equipment is ok.

ramthelinefeed · December 15, 2024, 7:52pm

Kinda surprised to learn that OSM doesn’t have a back-up ISP to use in this circumstance. Too pricey?

James_Derrick · December 15, 2024, 8:16pm

After leading several major incident service recovery teams, the details of a real-world failure are always more complex than anyone expects.

I’d suggest patience until the hosting provider issues a root-cause report, and this can be used to review the overall architecture against availability goals, and yes, costs.

Firefishy · December 15, 2024, 8:25pm

It is easy to judge, but we tried to build something reliable, cost effective and simple to manage. We have learnt.

We have dual redundant links via separate physical hardware from our side to our Tier 1 ISP. We unexpectedly discovered their equipment is a single point failure. Their extended outage is an extreme disappointment to us.

We are an extremely small team. The OSMF budget is tiny and we could definitely use more help. Real world experience.

We don’t run BGP because we don’t feel comfortable with the leap in requirements, cost (proper BGP capable routers) and we as a team don’t have deep experience with it. We also don’t have RIPE membership, or our own ASN (lease?) or our own IPv4 subnet (lease?)

Ironically we signed a contract with a new ISP in the last few days. Install is on-going (fibre runs, modules & patching) and we expect to run old and new side-by-side for 6 months. Significantly better resilience (redundant ISP side equipment, VRRP both ways, multiple upstream peers… 2x diverse 10G fibre links).

flohoff · December 15, 2024, 8:30pm

I am coming from the ISP World (Have been with Telefonica in Germany for 15 years and did Access and Core Routing stuff etc).

I am pretty astonished that even in case of a Hardware failure we are affected for 18 hours now. Typically you have spare part handling contracts 365x24x5 so your Service Partner like Computacenter/Dimension Data/NTT have 5 hours to get replacement Hardware on site.

I’d be very interested in that post-mortem.

Flo

admin · December 15, 2024, 8:34pm

To say we are disappointed is an understatement.

Our Tier 1 ISP has just confirmed hardware failure of the upstream router (after 16 hours!) and will provision new replacement device once it has been shipped to site.

309_308_307 · December 15, 2024, 8:40pm

I was gonna ask if you have fired that ISP already for sheer incompetence, but I guess that answers the question.

So that means about 3 more days of downtime, if they ship with DHL Economy, given their track record?

admin · December 15, 2024, 8:44pm

Splutter - I damn hope not. We are looking at the practicality of syncing the last few writes out of our primary database (Amsterdam) to secondary (Dublin) over our small out-of-band 4G modem + VPN backup link so that we can do a full site-failover and re-enable writes. No decision has been made yet.

OttoR · December 15, 2024, 8:55pm

Look on the bright side!
I hear you got new candidates to volunteer to the team as @flohoff and @ramthelinefeed show an interest in helping you out with their experience!

flohoff · December 15, 2024, 9:17pm

I am already watching a lot repos - I already inherited the Fossgis maschines for the German OSM chapter. Admin by profession. So yes - i am prepared

hlfan · December 15, 2024, 9:44pm

Just out of curiosity, how big is the difference you are talking about?

grin · December 15, 2024, 9:48pm

I am sure many of us can help in that, if nobody else you can always ping me.

Full BGP capable routers are not necessarily expensive, it’s just you don’t have to think that “business grade black box equipment” is the only way to go (without mentioning Brand Names™; though it’s possible to find some used stuff for reasonable prices as well). There are a lot of interesting stuff out there ranging from Linux based routers using professional network cards to pretty cheap Mikrotik based routers (unless you push through extreme amount of traffic, which you don’t, according to your graphs on prometheus.osm.org). And you may not even need full BGP. (And of course doing redundancy in-house is also possible, either with a spare on a shelf or live.)

Getting an ASN is pretty simple and may be even free, and you do not have to be a LIR. Same goes for IP subnet (you do not need to be LIR but it may be expensive unless you get some from a friendly entity, but you can always get “infinite” amount of IPv6 for free).

(I am ISP since 1995 or about.)

OttoR · December 15, 2024, 10:00pm

@admin @Firefishy don’t miss this one

ramthelinefeed · December 15, 2024, 10:01pm

I do actually earn my living as a compliance manager, cheeky chops.
Keep trying to persaude the infrastructure guys where I work to let me get them ISO 22301 certified. But as yet they keep wriggling out of it

ramthelinefeed · December 15, 2024, 10:09pm

Oh dear lord Sounds like we really got short-changed there!

flohoff · December 16, 2024, 6:30am

My bet - between 10-11 o’Clock today it starts to move Typically - either you get spare parts within 5 hours 24x7 - or “Next business day” - So i guess somebody at the ISP saved some money in spare part handling