Proposed bulk removal of brand:wikipedia tag from United States POIs

I am proposing a bulk removal of the brand:wikipedia tag from United States POIs.

Problem Statement

  1. brand:wikipedia is no longer automatically maintained by iD and thus is likely to become out of date over time. In the words of @Minh_Nguyen:

“if there’s any mismatch between the brand:wikipedia and brand:wikidata tags, brand:wikipedia is almost guaranteed to be wrong.”

  1. brand:wikipedia provides no additional data not already provided for in brand and brand:wikidata, which are maintained in Name Suggestion Index and updated automatically by iD. As iD updates the other two brand tags, brand:wikipedia remains untouched. The brand tag provides a human-readable brand name, so there is no need to keep brand:wikipedia around merely for human readability.

  2. The existence of this tag on current objects creates an implied need for mappers to maintain them, when such maintenance is useless work when it can just be removed and mappers can focus on more important work.

  3. Removing this tag removes the risk that an unaware data consumer will consume the brand:wikipedia tag and ingest wrong information from OSM.

Background

When tagging a brand (let’s use the most American example, McDonald’s), iD allows you to select a preset for that brand in the “Feature Type” pull-down:

image

Selecting “McDonald’s” automatically populates six tags on that POI, as follows:

image

Thus, brand information is expressed in two tags:

  1. The brand tag, which is a human-readable text string containing the brand
  2. The brand:wikidata tag, which points to the wikidata item for McDonald’s.

With this tagging, data consumers have a choice in consuming brand information: take the brand text string from the brand tag, or query the wikidata item, from which they can follow links to various language wikipedias, download a logo icon, or obtain other linked information about the POI not normally kept in OSM.

From about a 3-year period from 2018-2021, an additional tag, brand:wikipedia was also automatically added by iD via the Name Suggestion Index to POIs added by mappers, after which this behavior was discontinued. During this timeframe, 1.3 million brand:wikipedia objects were added globally. Since this behavior was removed, brand:wikipedia has been on a slow decline, as shown in this taginfo chronology:

Proposed Action

Via a series of geographically compact mechanical edits (for example, California alone has 22,000 cases of brand:wikipedia and would take a minimum of three changesets), remove the brand:wikipedia tag on all objects where a brand:wikidata and brand tag are already present.

11 Likes

I support this edit. How many mismatches are there currently between the wikipedia and wikidata tags? I would be interested to see the current impact.

1 Like

I think @Mateusz_Konieczny knows how to get those statistics.

Not really, I have not processed entire USA for https://matkoniecz.github.io/OSM-wikipedia-tag-validator-reports/ .

Yep, this highlights the challenge of having more than one link into wiki-land, you need scripts and tools to keep it all in sync. I spent the last week building tools to synchronize US boundaries with wikidata and it was quite a bit of work. And, it’s the kind of work I’d rather not have anyone unnecessarily waste time doing. Thankfully we have Name Suggestion Index to automatically handle brand and brand:wikidata to keep them in sync with little user intervention.

4 Likes

This is the kind of question that a SPARQL engine like Sophox or QLever should be able to answer within a matter of seconds using a federated query. However, Sophox is in disrepair, and for now I’m not very familiar with QLever, which stores and queries OSM data rather differently than Sophox, despite the common query language.

Based on taginfo, it’s not only used by the iD editor but also by Vespucci and JOSM user presets (primarily Russian and Korean).
https://taginfo.openstreetmap.org/keys/brand:wikipedia#projects

Based on OSM Changeset data, we can identify editors in the USA who did not use the iD editor to add such tags (brand:wikipedia) and seek their opinions.

At least Vespucci’s developer doesn’t seem to be a fan:

4 Likes

For context, brand:wikipedia was added begrudgingly in response to concerns that brand:wikidata=* wasn’t human-readable enough, and that brand=* was too ambiguous. Later, for consistency, the same pattern was extended to operator/network/flag:wikipedia=* to complement operator/network/flag:name=*.

In particular, there was concern about multiple brands sharing the same name in different geographies. This happens a lot with the names of banks, for example. For technical reasons, the Wikipedia article about each brand would have a unique title, even if both brands would have the same brand=*. In theory, this makes it easier to verify that the brand:wikidata=* tag corresponds to the correct brand.

Since then, name-suggestion-index has been able to clean up many local and regional brands that had originally been classified as global brands, making it much harder for mappers to accidentally tag the wrong brand based on its shared name. It has also gained the ability to scope a brand to a specific geometry rather than a whole country. For example, the scope of the multinational Burger King fast food chain excludes the famously unaffiliated Burger King in Mattoon, Illinois. Finally, iD added support for the not:brand:wikidata=* key so that mappers can affirm a very subtle distinction that would otherwise escape notice.

Since the beginning, the name-suggestion-index developers have caught a lot of flak for the inclusion of brand:wikipedia=* tags in its presets. Many mappers view the presence of three different keys to track the same information as a bit excessive. brand:wikipedia=* is the least stable of the three keys, since Wikipedia doesn’t consider its article titles to be even somewhat stable identifiers.

The raw brand:wikipedia=* values were never particularly good at telling you whether the brand=* and brand:wikidata=* referred to the right brand anyways. At the time that name-suggestion-index removed *:wikipedia=* from its presets, two-thirds of the presets (11,810 of 17,992) had brand/operator/network/flag:name=* values that didn’t match their *:wikipedia=* values. Excluding disambiguators in parentheses in the Wikipedia article titles, that still comes to 61% (10,958).

Instructions for reproducing this analysis
git clone https://github.com/osmlab/name-suggestion-index.git
git checkout 82b4751e6c141bf112656423d5c99863e4247b0b^
npm install
npm run build
jq '.presets | map(.addTags) | map((.brand // .operator // .network // .["flag:name"]) as $name | (.["brand:wikipedia"] // .["operator:wikidata"] // .["network:wikidata"] // .["flag:wikidata"]) as $wikidata | select($wikidata) | select($name != ($wikidata | split(":")[1]))) | length' dist/presets/nsi-id-presets.json

Many of these discrepancies arose because Wikipedia chooses to conflate some brands with the company that own or operate the brand, or even with the company that historically owned the brand before selling it off. This is especially common with oil distribution companies and convenience store companies. In these cases, Wikidata would ideally have a more specific item about the brand proper. In the meantime, the brand=* tag indicates what OSM prefers to record on the feature. When Wikidata splits out a brand item, name-suggestion-index sometimes needs to replace the *:wikidata=* tag, and it previously would remove the *:wikipedia=* tag at the same time, since there’s no exact match on Wikipedia. In other words, one way or another, these *:wikipedia=* tags would become irrelevant anyways.

5 Likes

While the eventual removal of brand:wikipedia seems like a good idea, does anybody have some feel (or numbers!) on how common it is to have a correct brand:wikipedia and incorrect brand:wikidata at the same time? That’s a hard to detect case where this kind of edit would replace inconsistency in favour of falsehood.

I have no idea how common such a thing would be, but at least data loss can be imagined.

1 Like

According to QLever, out of the 385,369 POIs in the U.S. that are tagged with both brand:wikipedia=en:* and brand:wikidata=*, 24,825 or 6.4% have brand:wikipedia=en:* tags that don’t match the canonical title of the English Wikipedia article per brand:wikidata=*.

The mismatches are geographically well-distributed and concentrated in populous cities:

Some caveats:

  • The error rate only counts features tagged with brands that are listed in name-suggestion-index. Unlisted brands are less likely to have directly relevant Wikipedia articles anyways.
  • The error rate doesn’t account for the 1,831 instances of brand:wikipedia=* referring to articles on non-English editions of Wikipedia.
  • The error rate doesn’t account for features that are tagged with brand:wikipedia=* even though brand:wikidata=* refers to a brand that lacks a Wikipedia article.
  • This query doesn’t resolve Wikipedia article redirects; it only compares brand:wikipedia=* to the canonical article title. Resolving redirects would likely decrease the error rate dramatically.

Beyond the statistics, a mapper is somewhat unlikely to intentionally change brand:wikipedia=* to something entirely unrelated but leave brand:wikidata=* untouched. iD would’ve warned about any departure from name-suggestion-index’s recommended tags up until NSI stopped recommending brand:wikipedia=*, and I think JOSM has a validator warning of some sort about mismatching Wikipedia and Wikidata tags.

(Thanks to @Danysan95 for help with these queries!)

4 Likes

Would it be better to have iD (and other editors) remove brand:wikipedia when a different NSI preset is applied? I don’t think it really needs to be a validator item that’s suggested for existing, unchanged POIs. I’m not opposed to doing away with it entirely via a mass edit though, and I remove it from POIs I edit.

Tangentially, is the next ‘wikipedia’ tag name:etymology:wikipedia the future target for erasure at 73K use v 1.2M for name:etymology:wikidata?

Only learned of this tag when someone came to my turf and planted this tag on an item I’d mapped. To boot, more than a few times found there to be a wikipedia article but no wikidata reference as much as observing some going around adding the wikidata etymology ref when there’s only the wikipedia ending up with both on the item. i.e. there being 3 variations

name:etymology:wikidata
name:etymology:wikipedia
both

even seen a few name:etymology:description with a text, some 10K+ use, no wiki page link discussing this tag that I could see.

image

Just asking /OT

Given that it is easy for anyone to change a wikidata item, and less easy to change wikipedia, could it be a good idea to keep the latter?

Any idea if name:etymology:wikipedia=* is being added organically or via a tool of some sort? Neither name-suggestion-index nor id-tagging-schema know anything about this key. In the U.S., there have been a few sudden spikes, suggesting a bulk or organized edit of some sort:

From what I can tell, some mappers use name:etymology:description=* as a scratchpad for details about the namesake, as an aid to anyone who might come along later and create a Wikidata item or write a Wikipedia article about the subject. This keeps name:etymology=* focused on the namesake’s name without getting polluted by disambiguators and such. It can be thought of as a very specific kind of fixme=* or note=* tag.

That said, there might be a good reason for keeping at least some of these tags as properties of the namesakes rather than turning them into fix-mes. Some signs about namesakes include a brief description and might not even include enough other information for a more-than-perfunctory Wikidata item. For example, I’m not sure an item could accurately describe “Chris” as a trail builder by trade without more local knowledge:

Wikipedia articles and Wikidata items are equally easy to change – they’re both wikis. But Wikipedia article titles change much more frequently than Wikidata item QIDs, and for many more reasons.

3 Likes

Not seen it anywhere being auto offered in a preset of any kind by any tool. Only started to add it when I name a street, building whatever with one that I’ve not come across before and read up on who/what is/was carrying that name and so learn a bit of history on the go. Curiously I’ve seen cases where Mr Osmose was flagging a wikipedia tag for a person still alive. Haven’t seen it lately.

On those spikes, one edit here added 39 wikidata tags to streets only with the name:etymology:wikipedia tag some 2 years ago. Maybe it was a broader discussion below my radar, Je ne sais pas.

I think more specifically, wikidata items represent distinct, stable (by rule) concepts that are intended to be reliable. Whereas, wikipedia articles can get renamed, split, merged, etc., in order to maximize their encyclopedic value. So a *:wikipedia tag might represent the wikipedia page about a thing, but it might also represent the wikipedia page about something that includes the thing or just a portion of the thing. Over time, decisions are made about notability and content gets shifted around. This is all subject to human maintenance of course, but conceptually a wikidata QID is a better attribute of an OSM object than a wikipedia page. And, if a QID and wiki page are linked, data consumers can access it via the QID anyways.

3 Likes

Forgive me here @Minh_Nguyen . My understanding of the editability of those wikis is lacking. I was thinking that a brand like “McDonald’s” would be more likely to have a protected page on Wikipedia. Does that concept exist on the wikidata site? In my own (very limited) experience, I have not had permissions issues editing wikidata.

Wikidata does have page protection. Items are often protected for the same reasons as Wikipedia articles, but at a smaller scale because Wikidata is more obscure than Wikipedia.

To be clear, the concern isn’t that the Wikipedia article’s contents can be vandalized or changed arbitrarily. Rather, the Wikipedia article’s title can and does change deliberately for reasons that are often irrelevant to OSM, such as the sudden need to disambiguate the article from a non-brand-related article. And in too many cases, the article is a poorer fit for the brand we’re trying to tag in the first place.

6 Likes

just saving time on manual removal from POIs would be useful (if manual removal is done and considered as a good idea)

for record: this edit seems OK as brand is human readable form of this info.

There are some visualisations showing whether streets were named after women/men - some powered by name:etymology tags.

Maybe it is result of people mapping for that?