Proposed bulk removal of brand:wikipedia tag from United States POIs

Not really, I have not processed entire USA for https://matkoniecz.github.io/OSM-wikipedia-tag-validator-reports/ .

Yep, this highlights the challenge of having more than one link into wiki-land, you need scripts and tools to keep it all in sync. I spent the last week building tools to synchronize US boundaries with wikidata and it was quite a bit of work. And, it’s the kind of work I’d rather not have anyone unnecessarily waste time doing. Thankfully we have Name Suggestion Index to automatically handle brand and brand:wikidata to keep them in sync with little user intervention.

4 Likes

This is the kind of question that a SPARQL engine like Sophox or QLever should be able to answer within a matter of seconds using a federated query. However, Sophox is in disrepair, and for now I’m not very familiar with QLever, which stores and queries OSM data rather differently than Sophox, despite the common query language.

Based on taginfo, it’s not only used by the iD editor but also by Vespucci and JOSM user presets (primarily Russian and Korean).
https://taginfo.openstreetmap.org/keys/brand:wikipedia#projects

Based on OSM Changeset data, we can identify editors in the USA who did not use the iD editor to add such tags (brand:wikipedia) and seek their opinions.

At least Vespucci’s developer doesn’t seem to be a fan:

4 Likes

For context, brand:wikipedia was added begrudgingly in response to concerns that brand:wikidata=* wasn’t human-readable enough, and that brand=* was too ambiguous. Later, for consistency, the same pattern was extended to operator/network/flag:wikipedia=* to complement operator/network/flag:name=*.

In particular, there was concern about multiple brands sharing the same name in different geographies. This happens a lot with the names of banks, for example. For technical reasons, the Wikipedia article about each brand would have a unique title, even if both brands would have the same brand=*. In theory, this makes it easier to verify that the brand:wikidata=* tag corresponds to the correct brand.

Since then, name-suggestion-index has been able to clean up many local and regional brands that had originally been classified as global brands, making it much harder for mappers to accidentally tag the wrong brand based on its shared name. It has also gained the ability to scope a brand to a specific geometry rather than a whole country. For example, the scope of the multinational Burger King fast food chain excludes the famously unaffiliated Burger King in Mattoon, Illinois. Finally, iD added support for the not:brand:wikidata=* key so that mappers can affirm a very subtle distinction that would otherwise escape notice.

Since the beginning, the name-suggestion-index developers have caught a lot of flak for the inclusion of brand:wikipedia=* tags in its presets. Many mappers view the presence of three different keys to track the same information as a bit excessive. brand:wikipedia=* is the least stable of the three keys, since Wikipedia doesn’t consider its article titles to be even somewhat stable identifiers.

The raw brand:wikipedia=* values were never particularly good at telling you whether the brand=* and brand:wikidata=* referred to the right brand anyways. At the time that name-suggestion-index removed *:wikipedia=* from its presets, two-thirds of the presets (11,810 of 17,992) had brand/operator/network/flag:name=* values that didn’t match their *:wikipedia=* values. Excluding disambiguators in parentheses in the Wikipedia article titles, that still comes to 61% (10,958).

Instructions for reproducing this analysis
git clone https://github.com/osmlab/name-suggestion-index.git
git checkout 82b4751e6c141bf112656423d5c99863e4247b0b^
npm install
npm run build
jq '.presets | map(.addTags) | map((.brand // .operator // .network // .["flag:name"]) as $name | (.["brand:wikipedia"] // .["operator:wikidata"] // .["network:wikidata"] // .["flag:wikidata"]) as $wikidata | select($wikidata) | select($name != ($wikidata | split(":")[1]))) | length' dist/presets/nsi-id-presets.json

Many of these discrepancies arose because Wikipedia chooses to conflate some brands with the company that own or operate the brand, or even with the company that historically owned the brand before selling it off. This is especially common with oil distribution companies and convenience store companies. In these cases, Wikidata would ideally have a more specific item about the brand proper. In the meantime, the brand=* tag indicates what OSM prefers to record on the feature. When Wikidata splits out a brand item, name-suggestion-index sometimes needs to replace the *:wikidata=* tag, and it previously would remove the *:wikipedia=* tag at the same time, since there’s no exact match on Wikipedia. In other words, one way or another, these *:wikipedia=* tags would become irrelevant anyways.

5 Likes

While the eventual removal of brand:wikipedia seems like a good idea, does anybody have some feel (or numbers!) on how common it is to have a correct brand:wikipedia and incorrect brand:wikidata at the same time? That’s a hard to detect case where this kind of edit would replace inconsistency in favour of falsehood.

I have no idea how common such a thing would be, but at least data loss can be imagined.

1 Like

According to QLever, out of the 385,369 POIs in the U.S. that are tagged with both brand:wikipedia=en:* and brand:wikidata=*, 24,825 or 6.4% have brand:wikipedia=en:* tags that don’t match the canonical title of the English Wikipedia article per brand:wikidata=*.

The mismatches are geographically well-distributed and concentrated in populous cities:

Some caveats:

  • The error rate only counts features tagged with brands that are listed in name-suggestion-index. Unlisted brands are less likely to have directly relevant Wikipedia articles anyways.
  • The error rate doesn’t account for the 1,831 instances of brand:wikipedia=* referring to articles on non-English editions of Wikipedia.
  • The error rate doesn’t account for features that are tagged with brand:wikipedia=* even though brand:wikidata=* refers to a brand that lacks a Wikipedia article.
  • This query doesn’t resolve Wikipedia article redirects; it only compares brand:wikipedia=* to the canonical article title. Resolving redirects would likely decrease the error rate dramatically.

Beyond the statistics, a mapper is somewhat unlikely to intentionally change brand:wikipedia=* to something entirely unrelated but leave brand:wikidata=* untouched. iD would’ve warned about any departure from name-suggestion-index’s recommended tags up until NSI stopped recommending brand:wikipedia=*, and I think JOSM has a validator warning of some sort about mismatching Wikipedia and Wikidata tags.

(Thanks to @Danysan95 for help with these queries!)

4 Likes

Would it be better to have iD (and other editors) remove brand:wikipedia when a different NSI preset is applied? I don’t think it really needs to be a validator item that’s suggested for existing, unchanged POIs. I’m not opposed to doing away with it entirely via a mass edit though, and I remove it from POIs I edit.

Tangentially, is the next ‘wikipedia’ tag name:etymology:wikipedia the future target for erasure at 73K use v 1.2M for name:etymology:wikidata?

Only learned of this tag when someone came to my turf and planted this tag on an item I’d mapped. To boot, more than a few times found there to be a wikipedia article but no wikidata reference as much as observing some going around adding the wikidata etymology ref when there’s only the wikipedia ending up with both on the item. i.e. there being 3 variations

name:etymology:wikidata
name:etymology:wikipedia
both

even seen a few name:etymology:description with a text, some 10K+ use, no wiki page link discussing this tag that I could see.

image

Just asking /OT

Given that it is easy for anyone to change a wikidata item, and less easy to change wikipedia, could it be a good idea to keep the latter?

Any idea if name:etymology:wikipedia=* is being added organically or via a tool of some sort? Neither name-suggestion-index nor id-tagging-schema know anything about this key. In the U.S., there have been a few sudden spikes, suggesting a bulk or organized edit of some sort:

From what I can tell, some mappers use name:etymology:description=* as a scratchpad for details about the namesake, as an aid to anyone who might come along later and create a Wikidata item or write a Wikipedia article about the subject. This keeps name:etymology=* focused on the namesake’s name without getting polluted by disambiguators and such. It can be thought of as a very specific kind of fixme=* or note=* tag.

That said, there might be a good reason for keeping at least some of these tags as properties of the namesakes rather than turning them into fix-mes. Some signs about namesakes include a brief description and might not even include enough other information for a more-than-perfunctory Wikidata item. For example, I’m not sure an item could accurately describe “Chris” as a trail builder by trade without more local knowledge:

Wikipedia articles and Wikidata items are equally easy to change – they’re both wikis. But Wikipedia article titles change much more frequently than Wikidata item QIDs, and for many more reasons.

3 Likes

Not seen it anywhere being auto offered in a preset of any kind by any tool. Only started to add it when I name a street, building whatever with one that I’ve not come across before and read up on who/what is/was carrying that name and so learn a bit of history on the go. Curiously I’ve seen cases where Mr Osmose was flagging a wikipedia tag for a person still alive. Haven’t seen it lately.

On those spikes, one edit here added 39 wikidata tags to streets only with the name:etymology:wikipedia tag some 2 years ago. Maybe it was a broader discussion below my radar, Je ne sais pas.

I think more specifically, wikidata items represent distinct, stable (by rule) concepts that are intended to be reliable. Whereas, wikipedia articles can get renamed, split, merged, etc., in order to maximize their encyclopedic value. So a *:wikipedia tag might represent the wikipedia page about a thing, but it might also represent the wikipedia page about something that includes the thing or just a portion of the thing. Over time, decisions are made about notability and content gets shifted around. This is all subject to human maintenance of course, but conceptually a wikidata QID is a better attribute of an OSM object than a wikipedia page. And, if a QID and wiki page are linked, data consumers can access it via the QID anyways.

3 Likes

Forgive me here @Minh_Nguyen . My understanding of the editability of those wikis is lacking. I was thinking that a brand like “McDonald’s” would be more likely to have a protected page on Wikipedia. Does that concept exist on the wikidata site? In my own (very limited) experience, I have not had permissions issues editing wikidata.

Wikidata does have page protection. Items are often protected for the same reasons as Wikipedia articles, but at a smaller scale because Wikidata is more obscure than Wikipedia.

To be clear, the concern isn’t that the Wikipedia article’s contents can be vandalized or changed arbitrarily. Rather, the Wikipedia article’s title can and does change deliberately for reasons that are often irrelevant to OSM, such as the sudden need to disambiguate the article from a non-brand-related article. And in too many cases, the article is a poorer fit for the brand we’re trying to tag in the first place.

6 Likes

just saving time on manual removal from POIs would be useful (if manual removal is done and considered as a good idea)

for record: this edit seems OK as brand is human readable form of this info.

There are some visualisations showing whether streets were named after women/men - some powered by name:etymology tags.

Maybe it is result of people mapping for that?

The main visualization I’m aware of is Open Etymology Map, which only considers name:etymology, name:etymology:description, and name:etymology:wikidata, but not name:etymology:wikipedia. The MapComplete Etymology theme similarly considers name:etymology:wikidata but not name:etymology:wikipedia. But maybe someone included it in an import or bulk edit just in case.

1 Like

By the way, there are 1,061 elements in the U.S. with brand:wikipedia=* but not brand:wikidata=*. Needless to say, we should keep these occurrences of brand:wikipedia=* for now, or replace them with brand:wikidata=* if there’s a straightforward way to do that en masse. At a glance, most are for brands that are in name-suggestion-index, so the brand:wikidata=* tags will be added eventually as mappers encounter validator warnings about them being missing.

1 Like

There is also EqualStreetNames

According to https://github.com/EqualStreetNames/module-process/blob/39b77bb3feefdd2fa8d8887b7f84f1e956d47f2f/docs/geojson.md?plain=1#L15-L17 tags used:

  • name
  • wikidata
  • name:etymology:wikidata
1 Like