Is there a way to discover broken wikimedia_commons=Category:* links?

ivanbranco · June 17, 2023, 2:12pm

Hi, we have tools to find broken urls (such as Keep Right), and tools to find broken wikimedia_commons=File:* links (such as Category:Commons pages with broken file links), but is there a way to check for wikimedia_commons=Category:* values that link to non-existent categories (so we can fix them)?

Mateusz_Konieczny · June 17, 2023, 2:16pm

Are you looking for broken links on Wiki? Or in OSM data?

ivanbranco · June 17, 2023, 2:26pm

In OSM data. Let’s say someone made a typo, or the category has been deleted, is there a way to discover it and fix the value on OSM?

Nadjita · June 17, 2023, 3:10pm

I don’t know of ant automated way, but you can use Overpass to find all objects and then write a script to check them:

[out:csv(::type, ::id, wikimedia_commons)][timeout:120];
(
  node["wikimedia_commons"~"Category:"];
  way["wikimedia_commons"~"Category:"];
  relation["wikimedia_commons"~"Category:"];
);
out body;

And as a script something like this:

#!/usr/bin/env bash
while IFS='' read -r line; do
  category=$(cut -f 3 <<<"${line}")
  if [ "${category:0:1}" = '"' ]; then
    category=$(jq -r . <<<"${category}")
  fi
  code=$(curl -s -o /dev/null -I -w "%{http_code}" "https://commons.wikimedia.org/wiki/${category}")
  if [ "${code}" -eq 404 ]; then
    echo "Broken: ${line}"
  fi
done < "interpreter.csv"

Where interpreter.csv is your saved Overpass data. Not very efficient, but maybe worth a shot?

ivanbranco · June 17, 2023, 4:00pm

I’m not sure if I did something wrong, but seems like the script is giving me back valid categories that have umlaut letters, so most of deutsch categories:

Nadjita · June 17, 2023, 7:59pm

Looks like you got semikolon-separted file, not tab. In that case cut -f 3 <<<"${line}" becomes cut -d ';' -f 3 <<<"${line}"

emvee · June 18, 2023, 10:21am

Getting all “wikimedia_commons=Category:* values” from openstreetmap using osmium and a .pbf file is a matter of:

osmium tags-filter area.osm.pbf -R nwr/wikimedia_commons=Category* -o commons_Category.opl

Then I typically use a python script to parse this output text file:

#!/usr/bin/python3

import re

categories = []

# https://osmcode.org/opl-file-format/#encoding
def unencode_opl_char(match):
    """ Given match r'(%)([0-9A-Fa-f]+)(%) return the unicode char """
    return chr(int(match.group(1), 16))

fn = 'commons_Category.opl'
with open(fn, 'r') as fh:
    for line in fh:
        # n1300148235 v3 dV c68983837 t2019-04-07T18:39:20Z i207581 uHjart Thistoric=memorial,name=Seemannsgrab,wikimedia_commons=Category:Seemannsgrab x9.6021629 y54.8759099
        opl_match = re.search(r'[,T]wikimedia_commons=Category:([^ ,]+)' , line)
        category = re.sub(r'%([0-9A-Fa-f]+)%', unencode_opl_char, opl_match.group(1))
        if category not in categories:
            categories.append(category)

print('Found', len(categories), 'categories')

For a planet file that currently gives back 41617 categories.

You should take care verifying all those without overloading the MediaWiki API

Ltrlg · June 26, 2023, 2:04pm

Downloading the all-titles dump first at dumps.wikimedia.org may be more efficient than many curls.

Ordoviz · September 22, 2023, 4:35pm

FYI: I wrote CommonsChecker4OSM to find broken wikimedia_commons links using osmium and the Wikimedia data dumps, and created MapRoulette challenges for Commons categories and files.