Is there a procedure to prevent link rot?

Yikes, that’s a good thing to watch out for when doing this kind of crawling. I could also imagine some services throwing up a CAPTCHA or similar that would foil the kind of keyword checking that KeepRight apparently did.

The Wayback Machine has an API for getting the latest crawl’s HTTP status, which could be useful if you don’t need the actual page contents. Maybe they publish a batch API or dataset somewhere.

3 Likes

in my experience: basically never

I would likely start from some small sample (100 cases?) and post link here to check is there any actual interest at all in such cleanup

I have had success with this, but that doesn’t mean all Maproulette tasks will. See MapRoulette (you’ll need to OAuth authenticate with your OSM credentials) and note that I created a task to add passenger platforms to rail lines in the USA; there were over 5000 tasks to complete and I got a few dozen done, “seeding” this, and then left it alone. I almost forgot about it, but when I returned to it (yes, years later), it had been substantially completed by many other (wonderful!) OSM contributors. Your mileage may vary, but this IS the power of crowd-tasking / gamification applications like Maproulette. Especially for larger (thousands of tasks) efforts, it really can take months or years, but it is sort of magical how this works. Of course, crowdsourcing aspects of OSM itself are much the same thing, I continue to be delighted.

Give the medium- to longer-term (months to years) a chance to work in OSM, be amazed by the results.

Depends what you mean by “serious push”. For example, in last several months, TomTom has created some country-specific tailored challenges and announced them in appropriate community categories in Discourse; Croatian ones have all been solved now (even the ones which were not very good due to many false positives). But they were relatively small and targeted (and TomTom employees have picked up the slack in few cases where Croatian community only partly solved them - but in any case in all of them there have been local activity, even with small community such as Croatian, and even in such relatively short timeframes).

Still, targeting specific community seems to pay off, even with minimal advertising (i.e. one post per challenge).
I for one would be interested in solving URL bitrot in Croatia and perhaps few of our neighbors (depending on amount of tasks), but it would need to be brought under my attention (also, once I completed the task, I’d likely forget about it completely forever, unless it had some way like RSS or similar to alert me that new things become broken)

2 Likes

As well as geographic communities, I wonder if one could do thematic communities. “Are all the hackerspaces etc” up to date

5 Likes

Another tool for checking website tags:

1 Like

unfortunately, the website checks were disabled with commit keepright / Code / Commit [r834]

2 Likes

I second this. In the Munich community we have had a success story with this. There were over 5000 links in different website tags, which did not report a 200 HTTP status. With the help of the local community we could reduce this to below 500 links.

For that purpose of link checking i also wrote a tool, because keepright was disabled. When there were websites which were not reachable anymore, i searched for that business on the internet and created a note in case i could not find a successor website for it while. With this process i opened several hundreds of notes. I got so much help from the OSM community in resolving many of the notes. This was and still is awesome. It was also very cool that the false positives of opened notes (website is down, but business is still active) was very low (i did not measure this, but this is my feeling).

4 Likes

For anyone embarking on such a task, in Norway I have noticed that quite a few small businesses have not maintained their website and have transitioned to Facebook profiles

So dead websites do not necessarily mean they’re no longer in business :slight_smile:

5 Likes

For completeness, osmose also has URL syntax check, but it does not verify the actual reachability of them (but still finds a surprising amount of completely broken URLs)

4 Likes

That’s a good point. I’ve seen that a lot here in the U.S. over the past few years. The only thing slowing this trend has been the pandemic, which forced some restaurants to add online ordering (but they usually contracted that out to a third-party website anyways).

Along with website, some of the social media keys like twitter/contact:twitter and facebook/contact:facebook sometimes go away too. Typically when a shop closes, the account just sits dormant, frozen in time; in my experience, inaccessible accounts tend to be due to name changes or an incorrect tag in the first place.

In case the shop has closed, the last post timestamp of its dormant social media account or the last successful Wayback Machine archive of its website can be useful for determining the shop’s end_date in OpenHistoricalMap.

3 Likes

There is a MapRoulette to find URL shorteners (and fix them)…

Not exactly link rot but extremely close.

Cleaned some of the

2 Likes

That’s a good one since a certain billionaire seems to have killed the public availability of t.co links…

4 Likes

This was one of my first thoughts when reading this thread.
But is seems that this is not an issue, as Overpass turbo doesn’t find any such links: http://overpass-turbo.eu/s/1xcB.

5 Likes

Is this an ongoing map roulette task? I mean, does the data get refreshed anytime so that new shortened URLs get added to the challenge?

1 Like

It gets refreshed when challenge creator (disclaimer: me) clicks on Rebuild tasks button, which I plan on doing from time to time; feel free to ping me if you need it before I do it by myself.

Apart from that, there is not yet support for continuous / auto-refresh challenges in Maproulette, although that feature is being discussed (so feel free to add your support there!): Continuous Challenges · Issue #1910 · maproulette/maproulette3 · GitHub

8 Likes

Remember there are a lot of websites that do not open if you are not in their country. Big shops in USA, big and tiny businesses in UK, big shops and government sites in Russia - don’t even give you an answer and you are disconnected by timeout

4 Likes

I thought it would be interesting to run through my http->https bot’s logs and look for websites that are not just non-functional, but don’t even DNS records set up for them. So not just a case of a website being down when the bot queried it, but a website that doesn’t even have the most basic internet configuration in place.

I put together a MapRoulette Challenge from a partial run. I’ll update it next week when the run completes and it should be reasonably easy to keep this updated on a regular (quarterly?) basis if the challenge gets any traction.

4 Likes

So what were the results?

Even a rough percentage from a partial run would be interesting to know.

Cool! It’s showing me a furniture shop in Wales. I confirm the website is down. I search the name on the internet. Their Twitter was last active in 2019, but nothing about them closing. No news articles about it either. I check on Mapillary, no images of that street. Any ideas what else to try?

I could leave a note, but if this is done on a large scale, wouldn’t that effectively be the same as a bot creating the notes?

1 Like