So does way 958597825 and 211410136. To determine whether a website is obsolete, the following are checked (here):
a. a response to GET request has status code 404 AND
b. the response text contains the string 404 AND
c. the response text contains the string page not found OR url was not found. Case insensitive search.
Complete code for the bot: GitHub - zabop/osm-url-screener. Currently, it is scheduled to run twice a day, and it is unlikely to modify more than a few features during each run. (The previous 3 runs, which I supervised, only affected 2, 1 and 3 contact:website values. My aim is high quality, not high quantity for this bot. If it works properly, with time, it’ll help up cleaning the DB.)
If you have any problems regarding anything above, raise it! (You can also message me or email me at 404remover@protonmail.com, but public communication is almost always better I think.) If you want to stop the bot, open an issue on GitHub. That will halt all further invocations.
Thanks! You’re not the first to propose this, see previous threads at [1] and (in German) [2].
The downside to this is that a broken link is a fantastic indicator that the shop might not exist any more, and would benefit from an in-person visit (survey) to confirm what’s there now.
Removing the website tag without actually surveying the shop in some ways makes the real problem (the outdated POI) worse, because it resets the last modified date. For POIs that don’t have a check_date tag, the last modified date is what outdoor mapping apps (StreetComplete, Every Door) rely on to decide whether to ask their users to resurvey the POI. So the concern is that removing the website actually makes it less likely that someone will go and have a look what’s there now.
I would suggest you look for POIs with broken websites, but then find a way to feed that into surveying workflows. There are some suggestions in the linked threads.
Do you have an opinion on restricting the automated fix to features which already have a check_date value? There are quite a lot of those as well (see London example).
find a way to feed that into surveying workflows. There are some suggestions in the linked threads
Thanks! Oh, I see, you’ve had a check_date-related idea in the German thread (translation using Google Translate) in the context of an MR challenge:
This is precisely what gave me the idea to limit the challenge to POIs that check_date have a recent history, because that means a StreetComplete or EveryDoor user has recently visited. This makes it much more likely that I can simply Google the “solution” from home. And it avoids the problem of “resetting” the last edited date for a POI that has long since disappeared, thus making the data appear more recent than it actually is.
It is also mentioned in that thread that there are very few check_dates in the DB. Their number is rapidly rising (thanks to StreetComplete I believe):
Hey - just poking in to say that I really appreciate the way you’re going about this - automated edit proposal wiki page, osm forum thread, code made public, deployment delay via issue for holding for community feedback - this is a model approach. Thanks for clearly putting in the effort to be a “good citizen”
For the reaons given by osmuser63783 I’m also not convinced that removing broken links is the best things to do. But certainly having a tool that flags them for people to review could be very useful.
A few comments:
Be careful with automated tools that check websites. Some servers may be configured to present different things to clients they believe to be bots.
It would probably be useful to consider more than just 404 errors (including 403, and various server and DNS errors). But see also the point above. Redirects (particularly to different domains) can also be a useful signal.
The website=* tag is much more widely used than the contact:website=* tag (220,705 vs 35,669 in the UK). So whatever tests you’re doing on contact:website=* you might want to also do on website=*. The tag url=* also appears on around 7,000 objects.
I have been running a link checker every night on all the websites I maintain for more than a decade now and found it is not as easy as it looks. A 404 is only one of many results you might get, common are also redirects or a website just not answering any more. I suggest checking a URL several times over several days also because intermittent problems are quite common. I have seen websites that reproducibly do not work certain times of the day (when my checker was scheduled to run) but work fine at all other times.
I would always see a broken website only as a hint that something is wrong with that entry, not as something that should be fixed automatically.
I once proposed an automatic cleanup of website tags for values that were so egregiously wrong that the DNS for the website couldn’t even resolve to an IP address. This isn’t just a broken website in the URL changed, it’s so fundamentally broken that having the value is worthless. Enough people argued about a wholesale automated change that I put together a MR challenge. In the last 18 months, only 2-3% of the tags have been fixed. IMHO, MapRoulette is not a good solution as automation is clearly appropriate and would greatly improve the data quality in OSM. But it’s exhausting trying to fight that fight with the anti-automationists.
If I remember correctly, Osmose also had a warning for “website is no longer valid, verify if POI is still there”. (Not finding it now… was I just imagining it?)
Something like Osmose might be a better fit for this data since it can be viewed by local mappers who are in position to survey the POI. On MapRoulette you’d have to know about the challenge and then find a task close by to you where you can survey.
MR and website values not even in the DNS: I think asking humans to check websites with unresolved IPs and remove them if they don’t find them (they won’t) is in many aspects equivalent to a cumbersome, time-consuming automation. We create lots of work while simultaneously lose the advantage of having broken links:
Plus, we create a challenge where MR users can just mindlessly click through, which is not a good lesson to have in general (I’m a bit biased against MR after reading some of this).
I think it would be better not to have broken links, but if the price to get there is human edits, then I don’t feel it worths the price. There are so much other high impact stuff to do.
I think I can see both sides of the argument a bit, I agree more with removing links not even in the DNS than not removing them. Not gonna die on this hill or let it interfere with my OSM fun.
Sounds good! If someone feels keen to lead on this, I’m happy to assist them if I’m needed and I can be useful. My main topic of interest in OSM is power. I thought about this link removal as a nice easy bot-experiment to do (and I had a boring weekend). Not entirely unexpectedly it turns out this is more complicated then I initially thought!
So I’m returning to devote most of my OSM time to power-related stuff. (Sorry for the wasted reading time I caused.)