Yikes, that’s a good thing to watch out for when doing this kind of crawling. I could also imagine some services throwing up a CAPTCHA or similar that would foil the kind of keyword checking that KeepRight apparently did.
The Wayback Machine has an API for getting the latest crawl’s HTTP status, which could be useful if you don’t need the actual page contents. Maybe they publish a batch API or dataset somewhere.
I have had success with this, but that doesn’t mean all Maproulette tasks will. See MapRoulette (you’ll need to OAuth authenticate with your OSM credentials) and note that I created a task to add passenger platforms to rail lines in the USA; there were over 5000 tasks to complete and I got a few dozen done, “seeding” this, and then left it alone. I almost forgot about it, but when I returned to it (yes, years later), it had been substantially completed by many other (wonderful!) OSM contributors. Your mileage may vary, but this IS the power of crowd-tasking / gamification applications like Maproulette. Especially for larger (thousands of tasks) efforts, it really can take months or years, but it is sort of magical how this works. Of course, crowdsourcing aspects of OSM itself are much the same thing, I continue to be delighted.
Give the medium- to longer-term (months to years) a chance to work in OSM, be amazed by the results.
Depends what you mean by “serious push”. For example, in last several months, TomTom has created some country-specific tailored challenges and announced them in appropriate community categories in Discourse; Croatian ones have all been solved now (even the ones which were not very good due to many false positives). But they were relatively small and targeted (and TomTom employees have picked up the slack in few cases where Croatian community only partly solved them - but in any case in all of them there have been local activity, even with small community such as Croatian, and even in such relatively short timeframes).
Still, targeting specific community seems to pay off, even with minimal advertising (i.e. one post per challenge).
I for one would be interested in solving URL bitrot in Croatia and perhaps few of our neighbors (depending on amount of tasks), but it would need to be brought under my attention (also, once I completed the task, I’d likely forget about it completely forever, unless it had some way like RSS or similar to alert me that new things become broken)
I second this. In the Munich community we have had a success story with this. There were over 5000 links in different website tags, which did not report a 200 HTTP status. With the help of the local community we could reduce this to below 500 links.
For that purpose of link checking i also wrote a tool, because keepright was disabled. When there were websites which were not reachable anymore, i searched for that business on the internet and created a note in case i could not find a successor website for it while. With this process i opened several hundreds of notes. I got so much help from the OSM community in resolving many of the notes. This was and still is awesome. It was also very cool that the false positives of opened notes (website is down, but business is still active) was very low (i did not measure this, but this is my feeling).
For anyone embarking on such a task, in Norway I have noticed that quite a few small businesses have not maintained their website and have transitioned to Facebook profiles
So dead websites do not necessarily mean they’re no longer in business
For completeness, osmose also has URL syntax check, but it does not verify the actual reachability of them (but still finds a surprising amount of completely broken URLs)
That’s a good point. I’ve seen that a lot here in the U.S. over the past few years. The only thing slowing this trend has been the pandemic, which forced some restaurants to add online ordering (but they usually contracted that out to a third-party website anyways).
Along with website, some of the social media keys like twitter/contact:twitter and facebook/contact:facebook sometimes go away too. Typically when a shop closes, the account just sits dormant, frozen in time; in my experience, inaccessible accounts tend to be due to name changes or an incorrect tag in the first place.
In case the shop has closed, the last post timestamp of its dormant social media account or the last successful Wayback Machine archive of its website can be useful for determining the shop’s end_date in OpenHistoricalMap.
This was one of my first thoughts when reading this thread.
But is seems that this is not an issue, as Overpass turbo doesn’t find any such links: http://overpass-turbo.eu/s/1xcB.
It gets refreshed when challenge creator (disclaimer: me) clicks on Rebuild tasks button, which I plan on doing from time to time; feel free to ping me if you need it before I do it by myself.
Remember there are a lot of websites that do not open if you are not in their country. Big shops in USA, big and tiny businesses in UK, big shops and government sites in Russia - don’t even give you an answer and you are disconnected by timeout
I thought it would be interesting to run through my http->https bot’s logs and look for websites that are not just non-functional, but don’t even DNS records set up for them. So not just a case of a website being down when the bot queried it, but a website that doesn’t even have the most basic internet configuration in place.
I put together a MapRoulette Challenge from a partial run. I’ll update it next week when the run completes and it should be reasonably easy to keep this updated on a regular (quarterly?) basis if the challenge gets any traction.
Cool! It’s showing me a furniture shop in Wales. I confirm the website is down. I search the name on the internet. Their Twitter was last active in 2019, but nothing about them closing. No news articles about it either. I check on Mapillary, no images of that street. Any ideas what else to try?
I could leave a note, but if this is done on a large scale, wouldn’t that effectively be the same as a bot creating the notes?