Is there a tool to compare OSM and Wikimedia Commons coordinates?

ivanbranco · March 19, 2024, 11:52am

On OpenStreetMap, it is possible to link Wikimedia Commons files using this tag: wikimedia_commons=File:xyz.jpg

Is there a way/tool to check for files whose coordinates differ significantly between Wikimedia Commons and OpenStreetMap? Such a tool could be used for quality assurance, as it may indicate either 1. that someone has linked the wrong image on OSM or 2. that the coordinates on Wikimedia Commons need to be corrected.

an example: Way: ‪Santuario de la Naturaleza Rocas de Constitución‬ (‪788001966‬) | OpenStreetMap
this way links to an image that has coordinates that are almost 1.880 km away! (here Wikimedia is wrong)

another example: Way History: ‪Passage de l'Ancre‬ (‪263190246‬) | OpenStreetMap
here a building passage in Paris links to an image of a building passage in Narbonne, 620 km away. (here OSM is wrong)

[This post was crossposted on Wikimaps]

p.s. If you are interested in Wikimedia Commons QA I sometimes update these MapRoulette challenges: Check and fix wikimedia_commons syntax and Check Commons files that don’t exist

dieterdreist · March 19, 2024, 12:06pm

Those coordinates are not the same, in OSM we record the place where something is, while for photographs usually you add the position from where the picture is taken (but they also shouldn’t be thousands of kilometers away, I agree).

ivanbranco · March 19, 2024, 12:09pm

That’s why I specified “differ significantly .” You would need a really good camera to capture Narbonne from Paris! Of course this tool would eventually have a tolerance threshold.

Duja · March 19, 2024, 12:09pm

Meh.

[Key:wikimedia_commons - OpenStreetMap Wiki]:

Adding wikimedia_commons=* to an object that has wikidata=* can be considered redundant because Wikidata itself often links to Wikimedia Commons. The vast majority of less experienced (newbies and occasional) users however is way more likely to understand the concept and importance of Wikimedia Commons than Wikidata. Even for experienced users, the human-readable value of wikimedia_commons=* is easier to interpret.

I don’t think it’s useful to invest effort in maintaining and curating yet another redundant tag, but your mileage may vary.

ivanbranco · March 19, 2024, 12:19pm

The examples I made have a wikidata=* tag by chance, wikimedia_commons=* is used by lot of users for elements that do not have enough notability for wikidata entries (e.g. guidepost). Also I know most people wouldn’t be happy to mass delete wikimedia_commons=* to elements having wikidata=*, so it would be cool at least to detect and fix the wrong ones.

An example, this element: Node: 11210057115 | OpenStreetMap has the image of this another element: Node: 11200420856 | OpenStreetMap which is 900 m away, and isn’t notable enough for a wikidata entry.

alan_gr · March 19, 2024, 12:21pm

It’s only redundant if the object has a wikidata entry though? Is it likely that most wikimedia commons images used in OSM are linked from wikidata? E.g. how would an individual hiking guidepost get linked from wikidata?

Duja · March 19, 2024, 12:40pm

Wikidata entries almost always have a link to corresponding Commons category, and often a selected photo from it. Those entries exist only for, let’s say, “notable” features; an ordinary hiking trail or a guidepost seldom has one. So ok, it makes sense to link an individual Commons image from an OSM object (similar as a Mapillary photo).

Taginfo, however, reveals that out of 211,000 objects tagged with wikimedia_commons 118,500 objects also have wikidata, presumably redundant. But fair enough, that still leaves some 100,000 which are missing wikidata, for one reason or another.

dieterdreist · March 19, 2024, 1:26pm

it could be checked whether it is redundant.

Minh_Nguyen · March 19, 2024, 5:27pm

In theory, Sophox and QLever would be the perfect tools for the job. But it’s a bit more difficult than it’s supposed to be. I haven’t figured it out yet, but maybe this is enough information for you to get to the finish line.

Like Wikidata, Wikimedia Commons has its own SPARQL endpoint. For example, you can query for geotagged images based on the location of either the camera or the depicted object. This relies on each file’s description page to be tagged with structured data. In theory, you could write a federated query in Sophox that nests a Commons query inside an OSM query. Unfortunately, it’s currently difficult to federate the Commons endpoint with another SPARQL endpoint, because it requires an OAuth token, presumably to prevent abuse by image scrapers. If not for that limitation, this Sophox query might stand a chance of returning OSM elements joined with Commons coordinates.

There’s an older way to geotag Commons files, by adding a {{Location}} template to the file description page. Fortunately, Commons has a MediaWiki extension installed that exposes these templates as part of the MediaWiki API. For example, you can query for pages by title with their distance from a given coordinate, or for pages within a certain distance of a given coordinate. Sophox can call the MediaWiki API directly, but unfortunately not for these particular queries. If it did, we could even get the geotagged coordinates without a distance, then use the around service to filter the results within Sophox.

QLever indexes both OSM and Wikimedia Commons in the same triplestore, which in theory would make it possible to join the two datasets more performantly without leaving QLever. This already works quite well for OSM–Wikidata and OpenHistoricalMap–Wikidata queries, because OSM and OHM data have been postprocessed to resolve wikidata=* and wikipedia=* tags to the referenced entities. Unfortunately, wikimedia_commons=* isn’t being resolved yet:

github.com/ad-freiburg/osm2rdf

Convert OpenStreetMap wikimedia_commons tags to IRIs

opened 02:24AM - 24 Jan 24 UTC

1ec5

In OpenStreetMap and OpenHistoricalMap, the [`wikimedia_commons`](https://wiki.o…penstreetmap.org/wiki/Key:wikimedia_commons) key can be set to a page name on Wikimedia Commons. The page is typically a file description page in the File: namespace, but sometimes it’s a category in the Category: namespace or a gallery in the main namespace instead. It would be convenient to have `osm:wikimedia_commons` triples that point to `sdc:` entities for files. Categories and galleries are usually linked to Wikidata items in the same manner as Wikipedia articles, so I suppose there would be `schema:about` triples for those. One use case is to associate map features with Commons images for a better sense of context. OSM-based applications like OsmAnd can fetch the image and associated licensing information using direct MediaWiki API calls because they’re working with a single map feature at a time. OpenHistoricalMap/issues#581 tracks something similar that would be built into the OHM website. But there’s also some value in being able to query for linked images en masse. For example, a query could return map features whose `wikimedia_commons` tag: * Refers to a nonexistent (possibly deleted) image * Refers to an image that depicts something completely different than the map feature or a photo that (in the context of OHM) was taken after the feature ceased to exist * Refers to a photo that was taken too long ago and should be replaced by something more recent * Refers to a photo that remains copyrighted locally, even though it’s in the public domain in the U.S., where Wikimedia Commons is hosted

As a workaround, we can transform the wikimedia_commons=* tag value into a URL and match it against URLs of Commons files. Unfortunately, merely querying for any files with their coordinates and URLs is producing an out-of-memory error, likely because 27.3 million coordinates of the point of view (P1259) statements is way too much to simply join on another property without more filtering. I assume some bot went around adding this property to every image with EXIF coordinates. Fortunately, coordinates of depicted place (P9149) appears on a much more manageable 8.2 million files. This modified query still times out, but at least it doesn’t balk right away. We could probably filter the OSM part of the query to certain kinds of features or other criteria to make it more performant.