Is there a tool to compare OSM and Wikimedia Commons coordinates?

On OpenStreetMap, it is possible to link Wikimedia Commons files using this tag: wikimedia_commons=File:xyz.jpg

Is there a way/tool to check for files whose coordinates differ significantly between Wikimedia Commons and OpenStreetMap? Such a tool could be used for quality assurance, as it may indicate either 1. that someone has linked the wrong image on OSM or 2. that the coordinates on Wikimedia Commons need to be corrected.

an example: Way: ‪Santuario de la Naturaleza Rocas de Constitución‬ (‪788001966‬) | OpenStreetMap
this way links to an image that has coordinates that are almost 1.880 km away! (here Wikimedia is wrong)

another example: Way History: ‪Passage de l'Ancre‬ (‪263190246‬) | OpenStreetMap
here a building passage in Paris links to an image of a building passage in Narbonne, 620 km away. (here OSM is wrong)

[This post was crossposted on Wikimaps]

p.s. If you are interested in Wikimedia Commons QA I sometimes update these MapRoulette challenges: Check and fix wikimedia_commons syntax and Check Commons files that don’t exist

Those coordinates are not the same, in OSM we record the place where something is, while for photographs usually you add the position from where the picture is taken (but they also shouldn’t be thousands of kilometers away, I agree).


That’s why I specified “differ significantly .” You would need a really good camera to capture Narbonne from Paris! :stuck_out_tongue: Of course this tool would eventually have a tolerance threshold.


[Key:wikimedia_commons - OpenStreetMap Wiki]:

Adding wikimedia_commons=* to an object that has wikidata=* can be considered redundant because Wikidata itself often links to Wikimedia Commons. The vast majority of less experienced (newbies and occasional) users however is way more likely to understand the concept and importance of Wikimedia Commons than Wikidata. Even for experienced users, the human-readable value of wikimedia_commons=* is easier to interpret.

I don’t think it’s useful to invest effort in maintaining and curating yet another redundant tag, but your mileage may vary.

The examples I made have a wikidata=* tag by chance, wikimedia_commons=* is used by lot of users for elements that do not have enough notability for wikidata entries (e.g. guidepost). Also I know most people wouldn’t be happy to mass delete wikimedia_commons=* to elements having wikidata=*, so it would be cool at least to detect and fix the wrong ones.

An example, this element: Node: 11210057115 | OpenStreetMap has the image of this another element: Node: 11200420856 | OpenStreetMap which is 900 m away, and isn’t notable enough for a wikidata entry.

1 Like

It’s only redundant if the object has a wikidata entry though? Is it likely that most wikimedia commons images used in OSM are linked from wikidata? E.g. how would an individual hiking guidepost get linked from wikidata?


Wikidata entries almost always have a link to corresponding Commons category, and often a selected photo from it. Those entries exist only for, let’s say, “notable” features; an ordinary hiking trail or a guidepost seldom has one. So ok, it makes sense to link an individual Commons image from an OSM object (similar as a Mapillary photo).

Taginfo, however, reveals that out of 211,000 objects tagged with wikimedia_commons 118,500 objects also have wikidata, presumably redundant. But fair enough, that still leaves some 100,000 which are missing wikidata, for one reason or another.

it could be checked whether it is redundant.

In theory, Sophox and QLever would be the perfect tools for the job. But it’s a bit more difficult than it’s supposed to be. I haven’t figured it out yet, but maybe this is enough information for you to get to the finish line.

Like Wikidata, Wikimedia Commons has its own SPARQL endpoint. For example, you can query for geotagged images based on the location of either the camera or the depicted object. This relies on each file’s description page to be tagged with structured data. In theory, you could write a federated query in Sophox that nests a Commons query inside an OSM query. Unfortunately, it’s currently difficult to federate the Commons endpoint with another SPARQL endpoint, because it requires an OAuth token, presumably to prevent abuse by image scrapers. If not for that limitation, this Sophox query might stand a chance of returning OSM elements joined with Commons coordinates.

There’s an older way to geotag Commons files, by adding a {{Location}} template to the file description page. Fortunately, Commons has a MediaWiki extension installed that exposes these templates as part of the MediaWiki API. For example, you can query for pages by title with their distance from a given coordinate, or for pages within a certain distance of a given coordinate. Sophox can call the MediaWiki API directly, but unfortunately not for these particular queries. If it did, we could even get the geotagged coordinates without a distance, then use the around service to filter the results within Sophox.

QLever indexes both OSM and Wikimedia Commons in the same triplestore, which in theory would make it possible to join the two datasets more performantly without leaving QLever. This already works quite well for OSM–Wikidata and OpenHistoricalMap–Wikidata queries, because OSM and OHM data have been postprocessed to resolve wikidata=* and wikipedia=* tags to the referenced entities. Unfortunately, wikimedia_commons=* isn’t being resolved yet:

As a workaround, we can transform the wikimedia_commons=* tag value into a URL and match it against URLs of Commons files. Unfortunately, merely querying for any files with their coordinates and URLs is producing an out-of-memory error, likely because 27.3 million coordinates of the point of view (P1259) statements is way too much to simply join on another property without more filtering. I assume some bot went around adding this property to every image with EXIF coordinates. Fortunately, coordinates of depicted place (P9149) appears on a much more manageable 8.2 million files. This modified query still times out, but at least it doesn’t balk right away. We could probably filter the OSM part of the query to certain kinds of features or other criteria to make it more performant.