Bot edit proposal: update wikidata tag redirects, where updated value would match present wikipedia tag

How would you feel about automatically updating wikidata tag in cases where it is a redirect?

It would be limited to USA and to cases where wikipedia tag is present and it matches redirect target (internal report id is wikipedia wikidata mismatch - follow wikidata redirect).

See Changeset: 143409952 | OpenStreetMap for an example of such edit done as part of approved bot edit in Poland.

See Way History: ‪Haven Lake‬ (‪198517296‬) | OpenStreetMap for manual edit of such type in USA.

Edit would be grouped, with one changeset per USA state, edits would be done again if new cases would be found or would appear in future.

https://matkoniecz.github.io/OSM-wikipedia-tag-validator-reports/USA%20-%20obvious.html#wikipedia%20wikidata%20mismatch%20-%20follow%20wikidata%20redirect has sample of edits (not all states are being processed by OSM-wikipedia-tag-validator-reports right now but it can change in future)

it appears that most cases are related to cleanup of Cebuano botpedia-related wikidata entries on Wikidata.

was also discussed in Slack

2 Likes

I agree with your proposed edits. There’s little to no risk of the redirect ever getting deleted, but Overpass queries don’t have the necessary context to resolve these redirects automatically. (Sophox and the Wikidata Query Service can, but you have to remember to add /owl:sameAs? to the property path.)

That said, I think we should proceed carefully with anything more aggressive based on Wikipedia article titles, because Wikipedia articles intentionally conflate related topics when the topics aren’t notable enough on their own. Sometimes Wikipedians have incorrectly extended this approach to Wikidata out of habit, merging two items together even though there are distinct concepts.

For example, your first sample edit would replace Q107920115 with Campbell (Q34647950). This was the result of someone merging the two items together last year. Originally, one item representing Campbell as a populated place was imported from GNIS via the Cebuano Wikipedia, while another item representing Campbell as a census-designated place was imported from the U.S. Census via the English Wikipedia. In OSM, this is the difference between a place=hamlet node and a boundary=census relation. Ideally, the node should be linked to a different item than the relation, because many statements like GNIS Feature ID (P590) and inception (P571) necessarily differ between them.

I don’t blame anyone for wanting to deduplicate items that come from the Cebuano Wikipedia. The Wikidata community is actively looking into the problem. Nevertheless, Wikidata considers it a good practice to unmerge cases where there are multiple overlapping concepts. In the past, some mappers have been critical of Wikidata precisely because of conflating a populated place with a government with a statistical boundary.

Fortunately, your proposed edits won’t exacerbate the problem, but to the extent that Wikidata does deconflate these items in the future, we’ll need a way to keep track of cases where a place node links to an item about a CDP and a boundary relation links to an item about a populated place. Hopefully that’ll be easier for you to query in the future.

2 Likes

that may be doable - are you (or someone else) interested in fixing reports where

  • place node links to an item about a CDP? (which wikidata item it is?)
  • place node links to an item about a CDP, not about a populated place
  • place node links to an item not about a populated place
  • boundary relation links to an item about a populated place (which wikidata item it is?)
  • boundary relation links to an item about a populated place and CDP
  • boundary relation links to an item about a populated place which is not stated to be CDP

?

I can try to add them to reports but it would take some time and is worth spending time if someone is invested in such cleanup.

(Also, https://matkoniecz.github.io/OSM-wikipedia-tag-validator-reports/ finds over 12 000 cases requiring human review in just part of USA [search for say USA/Texas/California, if state of interest for you is not there - feel free to comment and I will enable it], see also MapRoulette if you prefer MR. MR is by error type not by area but you can start from USA case and request next cases to be nearby, not random one)

That would be census-designated place in the United States (Q498162), as opposed to unincorporated community in the United States (Q17343829). Before a challenge like this would be feasible, someone would need to split out separate items about the unincorporated communities. However, it’s unclear to me if the Wikidata community wants this additional precision about unincorporated communities before similarly splitting incorporated place items. At the moment, cases that might be considered pedantic can already be modeled through dual-tagging it as an instance of both a CDP and unincorporated community, with qualifiers on various statements.

Eventually, the distinction may get even blurrier: since last year, county and state governments have been allowed to define their own CDP boundaries and names without the Census Bureau’s input, which would presumably be based on the traditional notions of where an unincorporated community is located. I don’t know if we have a straightforward way of knowing which CDPs are created or modified through this process.

Anyhow, to be clear, CDPs are just one example of the kinds of questionable merges that Wikipedians sometimes perform on Wikidata. The nice thing about the redirects is that it’s easier to track down and undo this conflation, but we’d have to weigh it against the convenience of linking directly to the canonical item in other cases where items got merged correctly. Maybe it would help to somehow quantify how many of these redirects belong to certain classes of items.