Hey, I was wondering if there is a agreed upon way to prevent link rot for features with the website tag. I was thinking of using internet archive to preserve links. But I wanted to ask the community first. I couldn’t find a wiki page on it.
Globally, I don’t think there is such a procedure at the moment, but maybe some local communities have developed their own solutions.
Relatedly, @b-jazz-bot goes around updating
website tags from insecure HTTP URLs to more secure HTTPS URLs based on what the HTTP URL redirects to. Presumably the source code could be adapted to check for broken or squatted links instead. Instead of just removing the
website tag automatically, a MapRoulette challenge could encourage mappers to double-check whether the business has closed or changed its name.
This does not seem extremely useful - linking copy of dead website of a given POI is not so useful as for example preserving reference links at Wikipedia.
Though if someone would do this it would not hurt.
I wonder how useful would be service that scrapes websites and if website is down 2 weeks in row, adds Note asking if entity closed or do they have new website, so local community resolves problem one way or another.
Uh, I’d be against automated tool adding OSM Notes. That sounds annoying (imagine if Osmose for example tried to open all issues it finds as Notes ) and generally a potential for disaster. Notes should be kept as they are - written by Humans for other Humans.
I’d personally prefer checker which would generate RSS feed with such website/URL problems in specified bbox (I have few places I follow and try to fix soon, so I’d be able to fix those fast, as opposed if whole world was presented to me to fix which I’d likely just skip instead )
Actually, it was tingling in back of my head that I’ve seen that before… Quick looking reveals that it seems that Keepright was checking those website issues: Keep Right/410 websites - OpenStreetMap Wiki, but:
You are right, QA tools make much more sense than Notes.
About a year ago, I processed all of the log files from the http->https bot and gathered hundreds of thousands of websites that are tagged in OSM. I wrote a quick little script to fetch (I don’t remember if just a HEAD call, or a full GET) all of those sites, 100 at a time in parallel. It wasn’t long before Akamai tagged my IP as nefarious and my regular internet needs, like banking, was blocked because so many big sites use Akamai’s database of bad actors.
I could revive that work and make some changes to evade Akamai mischaracterizing me as a “hacker” and put together a MapRoulette challenge with severely broken links (DNS doesn’t resolve, non-200 response code, etc). But I wonder how often random Maproulette projects actually get worked on without serious push to get others interested. I’ve had a couple of other projects that no a single sole touched. So it makes me question if my work will be all for naught.
Yikes, that’s a good thing to watch out for when doing this kind of crawling. I could also imagine some services throwing up a CAPTCHA or similar that would foil the kind of keyword checking that KeepRight apparently did.
The Wayback Machine has an API for getting the latest crawl’s HTTP status, which could be useful if you don’t need the actual page contents. Maybe they publish a batch API or dataset somewhere.
in my experience: basically never
I would likely start from some small sample (100 cases?) and post link here to check is there any actual interest at all in such cleanup
I have had success with this, but that doesn’t mean all Maproulette tasks will. See MapRoulette (you’ll need to OAuth authenticate with your OSM credentials) and note that I created a task to add passenger platforms to rail lines in the USA; there were over 5000 tasks to complete and I got a few dozen done, “seeding” this, and then left it alone. I almost forgot about it, but when I returned to it (yes, years later), it had been substantially completed by many other (wonderful!) OSM contributors. Your mileage may vary, but this IS the power of crowd-tasking / gamification applications like Maproulette. Especially for larger (thousands of tasks) efforts, it really can take months or years, but it is sort of magical how this works. Of course, crowdsourcing aspects of OSM itself are much the same thing, I continue to be delighted.
Give the medium- to longer-term (months to years) a chance to work in OSM, be amazed by the results.
Depends what you mean by “serious push”. For example, in last several months, TomTom has created some country-specific tailored challenges and announced them in appropriate community categories in Discourse; Croatian ones have all been solved now (even the ones which were not very good due to many false positives). But they were relatively small and targeted (and TomTom employees have picked up the slack in few cases where Croatian community only partly solved them - but in any case in all of them there have been local activity, even with small community such as Croatian, and even in such relatively short timeframes).
Still, targeting specific community seems to pay off, even with minimal advertising (i.e. one post per challenge).
I for one would be interested in solving URL bitrot in Croatia and perhaps few of our neighbors (depending on amount of tasks), but it would need to be brought under my attention (also, once I completed the task, I’d likely forget about it completely forever, unless it had some way like RSS or similar to alert me that new things become broken)
As well as geographic communities, I wonder if one could do thematic communities. “Are all the hackerspaces etc” up to date
Another tool for checking
unfortunately, the website checks were disabled with commit keepright / Code / Commit [r834]
I second this. In the Munich community we have had a success story with this. There were over 5000 links in different website tags, which did not report a 200 HTTP status. With the help of the local community we could reduce this to below 500 links.
For that purpose of link checking i also wrote a tool, because keepright was disabled. When there were websites which were not reachable anymore, i searched for that business on the internet and created a note in case i could not find a successor website for it while. With this process i opened several hundreds of notes. I got so much help from the OSM community in resolving many of the notes. This was and still is awesome. It was also very cool that the false positives of opened notes (website is down, but business is still active) was very low (i did not measure this, but this is my feeling).
For anyone embarking on such a task, in Norway I have noticed that quite a few small businesses have not maintained their website and have transitioned to Facebook profiles
So dead websites do not necessarily mean they’re no longer in business
For completeness, osmose also has URL syntax check, but it does not verify the actual reachability of them (but still finds a surprising amount of completely broken URLs)
That’s a good point. I’ve seen that a lot here in the U.S. over the past few years. The only thing slowing this trend has been the pandemic, which forced some restaurants to add online ordering (but they usually contracted that out to a third-party website anyways).
website, some of the social media keys like
contact:facebook sometimes go away too. Typically when a shop closes, the account just sits dormant, frozen in time; in my experience, inaccessible accounts tend to be due to name changes or an incorrect tag in the first place.
In case the shop has closed, the last post timestamp of its dormant social media account or the last successful Wayback Machine archive of its website can be useful for determining the shop’s
end_date in OpenHistoricalMap.
There is a MapRoulette to find URL shorteners (and fix them)…
Not exactly link rot but extremely close.
Cleaned some of the