Tag synonym cleanup/consolidation

watmildon · January 29, 2023, 10:26pm

I would love to know about any historical discussion/efforts to do tag cleanup of the form: TagA, TagB, TagC etc are agreed to be synonyms (via wiki?) so let’s shift all tags to be TagA (most prominent on wiki, taginfo etc). My motivation is that many objects have more than one synonym attached to them which has created cases where the data in the tags disagree. Either because an editor changed one without the other or two objects with different synonyms were merged incorrectly. Having only one tag may help reduce this class of error.

Beyond that, there’s my general feeling that the principle of “one feature, one element” is naturally extended to “one element feature, one tag”. In cases where we cannot determine the most prominent/“best” name for a tag then we wait for consensus.

If it helps discussion I’ve been working on review of things with/needing GNIS IDs. There are 6 agreed upon synonyms on the wiki (“gnis:feature_id”, “gnis:id”, “tiger:PLACENS”, “NHD:GNIS_ID”, “nhd:gnis_id”, “ref:gnis”) with gnis:feature_id being the dominant tag in use.

I probably owe some time to write up a specific proposal for GNIS ids, but wanted to get the lay of the land a bit before proceeding.

ZeLonewolf · January 29, 2023, 10:35pm

That’s actually quite the can of worms, though in the US there tends to be a lot more acceptance of this kind of tag cleanup. Of course, it should be discussed first.

Awhile back I converted over 20,000 waterway=riverbank tags to natural=water + water=river in the US. I started a river modernization project to track and document the same cleanup globally. It was pretty non-controversial here, but it caused quite a bit of friction in quite a few places worldwide.

watmildon · January 29, 2023, 10:38pm

Mercifully for me the only tags currently causing me annoyance are US import related (GNIS id or NHD).

Thanks for the link!

stevea · January 29, 2023, 11:25pm

Please understand my reply here is “mild,” I’m not “pounding my shoe on the podium” about this.

There really are synonyms in both natural language and the resulting tags that end up in OSM. The reasons for this range from the obvious: somebody decided to tag “graveyard” instead of “cemetery” for example, and didn’t read our wiki on whether TagA or TagB would be more appropriate for a given semantic, to the obscure (regionalisms in English dialect, very slight differences in how something works mechanically, the meaning of a sign giving rise to different key values…).

But if we “blithely” mechanically conflate TagA and TagB (and there are many examples, even some including TagC and TagD), we would lose such subtly. “But,” you say, “how will a renderer grab a hold of a (single) tag and properly render?! Oh, my!” The answer is, conflation of multiple tags “down to” a single tag isn’t the job of OSM (Contributors) to “make easy by being a single tag,” it is the job of the renderer to say “well, graveyards and cemeteries are going to be rendered identically in my renderer, so I must gather both and assign them the same icon or border-fill, or whatever graphic / semiotic they’re going to get.”

If there is another reason you wish to “consolidate” or “clean up” such “synonyms,” I’m curious to hear what it is. I’m in listening mode, not “mild admonishment that might not be a good idea to do that” mode.

Thank you for seeking the “lay of the land.” That’s all I’m offering with this post, a simple sketch. And your initial reasons are good, though I do wonder if how things render affects why you are asking.

Edit: Our “whatever tag you wish” tenet of “plastic tagging” notwithstanding, I do not mean to convey by this post that OSM Contributors can lazily tag however they want to. Of course, especially if you are new to OSM mapping and there IS a tag for what you are mapping, it is correct (even preferred) to use that tag. This often means consulting our wiki for proper practice, a designed-to-be-easy-and-painless method of documenting our current tagging practices (and sometimes, future strategies to better tag). As time goes by, newer, fresh mapping/tagging (especially by novice mappers who might have a tendency to introduce “non-standard” keys/tags) better aligns with existing tagging, while simultaneously, “best tagging practices” evolve. Eventually (someday), it will be clear and obvious what to tag nearly all of the time, while we continue to allow plastic, novel tagging. Today, we’re only part-way there, but it does get better.

watmildon · January 29, 2023, 11:59pm

Thankfully the cleanup I’d actually like is an unrendered tag! More like tagging for the un-rendered (fields for tools trying to scan/parse/cleanup).

I’ll spend some time over the next few days writing up something more specific about my concerns and how a specific synonym cleanup may help address them. As mentioned, the multitude of tags that can contain an GNIS ID lead to object errors that (maybe?) could be noticed and avoided (is this hubris?) if there was only one way to tag this bit of info.

A reasonable example is this node. Originally an imported GNIS Populated Place, then (version 6) merged with another imported node for a church. Now we have a node with mismatched gnis:id and gnis:feature_id and a host of other tags that need to be disentangled. We can’t stop folks from conflating incorrect things together but it seems possible some (maybe small?) percentage of these could be avoided.

Absolutely not as critical but a nice side effect of reduction is the removal of duplicated info. There’s a ton of “duplicate” data in the db with many many many nodes having more than one synonym for GNIS ID and the same data in it (gnis:feature_id and gnis:id being the most common pair). For example, this node was originally added with gnis:feature_id and along the way managed to collect 3 more synonyms with identical data in them from various.

stevea · January 30, 2023, 12:40am

I don’t believe this is hubris at all, but thanks for asking, even rhetorically!

I also admire your ambition, these seem to me to be quite worthy data cleanup efforts. I wish you the best in them and perhaps I can help with ongoing dialog. Nice community we have here, nice map we have here, and both seem to be on an ever-upward spiral of “it’s getting better all the time” (as the Beatles once said).

Minh_Nguyen · January 30, 2023, 2:08am

As the namespaces imply, these keys came in from several massive imports of federal government datasets, including TIGER, NHD, and GNIS itself. Since GNIS is the federal government’s canonical gazetteer, each of these imports coincidentally brought in overlapping subsets of GNIS data with corresponding IDs, but there apparently wasn’t an attempt to harmonize the feature ID key. Perhaps there was a fear that these IDs only appeared to be GNIS feature IDs but could be something slightly different. (That’s the case with tiger:zip_left and tiger:zip_right, which are technically ZCTAs.)

When I finally got around to rewriting the wiki’s documentation about these keys, I chose the most common one to be the canonical key, leaving the others as synonyms because I wasn’t prepared to think about mass edits. gnis:feature_id is far and away the most common, because the GNIS import was the only general import of that dataset that included a variety of feature classes. Later, iD added a GNIS Feature ID field; since a field of this kind can only operate on a single key, the most common key, gnis:feature_id, was chosen as well.

At one point, some JOSM users were advocating for ref:gnis, which is more consistent with other government ID schemes, but I don’t think it’s worth the trouble to touch even more features for a small measure of consistency.

Right, if both features had had gnis:feature_id on them, iD would’ve prevented the mapper from casually merging the two features before examining why the IDs differ.

watmildon · January 30, 2023, 4:07am

Brian has been very kind in the OSMUS Slack and helpfully sent me “homework” of various older threads. It’s quite interesting.

The NHD:GNIS_ID and ndh:gnis_id tags are interesting in at least 3 ways. The first Minh has mentioned, they are almost exclusively from imports, most very old. The second (and one that may only annoy me ha) is that we have two “acceptable” keys that differ only in casing… which I don’t think I’ve seen anywhere else. My impression is that if I started adding WIKIDATA and Wikidata tags, people would be totally in the right to bulk correct them under the “typos are find for mechanical edit” principal. Last, but maybe most important for my goals about data consistency, is that the values were imported with leading zeros which makes querying/conflating/correlating even more of a hassle.

Getting a bit out of scope… user b1tw153 in the OSM US Slack been doing heroic work building out the capability to do feature matching from the GNIS dataset to search for invalid features, un added features etc. From sample runs I’ve been looking through to improve the matching, the NHD flavors of GNIS tags have been spot on. For whatever that is worth.

Getting broader and even more off topic: I suspect there’s a huge class of “nice to have” work to be done bringing older imports up to modern standards (in order of hopefully least to most controversy): source tags go in changeset descriptions, harmonization with current tags, dropping fields we would never import today etc. One presumes we’ve learned a lot in the past 15 years about what makes a successful import!

stevea · January 30, 2023, 4:13am

We have: Import/Guidelines - OpenStreetMap Wiki . There’s more (as lore, hints, experience…), but that wiki is a community-consensus good start!

stevea · January 27, 2024, 8:54am

Thank you for flagging a nasty post, community — and within minutes, that was quick. I received a rude, targeted message in Arabic, and I don’t appreciate that. Such noise is harassing, designed to instill fear and to be intimidating and/or fingerprinting. Yuck. We don’t put up with that in OSM. My shields are up but I know something nasty when it comes my way and bounces off as harmless. The originator’s account seems created for such purposes immediately before the attempted spoof, if it is helpful to point that out.

Again, I am deeply grateful at all who are watchers and crusaders against vandalism, spam and nefarious purpose in our project. I do some of that proactively when it’s in my yard (area of the map) and it is eternal vigilance. Stand strong together and we’re fine.

But I definitely just got probed. (Not in my posterior, I think you know what I mean).

And now, back to the (closed) dialog. Oh, it had gone quiet for a year. Bye!