Cleanup and normalization of GNIS imports to only use gnis:feature_id for the id tag

Finally continuing from the general discussion here: Tag synonym cleanup/consolidation - #4 by stevea. I would love to hear any feedback.

Background
I’ve been working with other mappers building tools to better utilize GNIS data in OSM. We’ve also just finished a big cleanup of features related to Secretarial Order 3404 aided by this tooling and cleaned up issues related to bad mergers etc.

I have a lot more experience with the db and this set of feature than in January and would love to proceed with a broad but simple cleanup: normalize the tagging to one tag.

The problems
We currently have 6 tags in wide use that all contain the identifier from the GNIS database. This is an annoyance for consumers of that data and also makes it easy for mappers to inadvertently merge things that should not be merged. Here’s some of the easier ones to detect
as they have mismatched ids in the various tag alternates.

We also have a huge number of id tags that contain leading zeros. ex: 0012345. This makes comparisons in common tools (overpass) more difficult than necessary.

The proposed solution
Move all OSM tagging to the most common GNIS tag of gnis:feature_id. The tags gnis:id, tiger:PLACENS, NHD:GNIS_ID, nhd:gnis_id, ref:gnis will all be deprecated and the wiki updated to reflect this. (~370k items)

All identifiers will have leading zeros stripped from their values.

For each item with only one of the above tags, the value will be moved to gnis:feature_id.
For each item with more than one tag but all the tags agree on the id, move the value to gnis:feature_id.
For each item with more than one tag but differing values of the id, manual review and cleanup. (~250 items)

Edits will be tagged as #gnisIdUpdate so they can be reviewed/tracked/reverted as necessary.

I would do all of these edits in JOSM and probably work state by state, reviewing as I go. I have a lot of experience with tag updates in JOSM and feel very comfortable in that environment.

I will also work with the JOSM maintainers to get their relation template updated from ref:gnis to gnis:feature_id.

I would love to start this work in the beginning of August.

Some questions I have

  1. For items with nhd:gnis_id and NHD:GNIS_ID, is it helpful for folks if I ensure they have source=nhd?
  2. Would folks prefer if I do this in a more programmatic way? I’m happy to write some code for it but seems unnecessary right now.
  3. Should we take this shot to normalize to ref:gnis? Would need to touch more objects (~1100k more)
  4. What am I missing?
  5. Any arguments against doing this update?

(edit: just updating with some numbers and another question)

3 Likes

Good point about preventing accidental incorrect merges. That to me would be reason enough to undertake this change.

I don’t see a problem with doing this programmatically, since any GNIS-, NHD-, or TIGER-imported feature would’ve been touched by many bots by this point. However, if you’re confident that you’ll be able to complete this task as part of a more manual cleanup and review, then more power to you.

How important is it that the external ID fall under the ref:* namespace? Should it also adhere to the convention of country namespaces, as in ref:US:gnis?

I’m not sure that JOSM’s preference for ref:gnis should be the sole deciding factor, since the editors that use id-tagging-schema (iD, Rapid, Go Map!!, etc.) recognize gnis:feature_id but not ref:gnis.

Most of the keys stem from an outdated practice of tagging import-specific metadata under an import-specific namespace, hence the references to TIGER and NHD. If we wish to avoid this practice, there are also other potential keys to clean up, such as NHD:GNIS_Name and gnis:name (name:GNIS?).

3 Likes

I am happy to “modernize” the tagging of older imports however folks like. Most of the old tags don’t provide much value and can probably be removed (looking at you gnis:state_id and gnis:ST_num gnis:ST_alpha et al). This was scoped to keep things tidy but if there’s an appetite for broader cleanup while I’m here, happy to do so.

As to which to pick as the “winner” id tag, I am almost completely ambivalent to where we land. I think landing on gnis:feaure_id is the path of least resistance but ultimately don’t really care what the text of the tag name is. That said, having to update only one editor (JOSM) feels like a winning argument to me.

Once we figure out a solution this seems like a good way that could be used to ingugae new mappers. This issue has a pretty straightforward history of too many similar tags that now are being merged into a simpler coherent structure. We could explain some of the commons process we use and how arrived at a solution in this situation.

Pretty much condence this thread. Add some links so the user can look for additional context as needed. Plug it all into a quest and allow people to learn about OSM while improving OSM.

While I’m more in favor of ref:gnis (or with a country namespace as well like ref:US:gnis as Minh suggested) as this is probably the opportunity to get this to be consistent with other ref:*=* based codes, ultimately I’ll support anything which allows us to have one key instead of the 6 we have today.

1 Like

I’m absolutely in favor of normalizing the key and values for GNIS Feature IDs in OSM. Like the others here, I don’t think it matters which key we use.

Normalizing the keys and values does solve some real problems with the usability of the data. Moving from the gnis:feature_id key to a more consistent key like ref:US:gnis might be nice, but I don’t know that it actually solves any problems.

:+1: for the use of country-specific name spaces. I’m not aware of “GNIS” meaning something else in another country, but if we’re picking a value it makes sense to ensure that there won’t be an accidental name collision down the line.

I agree on this. +1 for ref:US:gnis, additional it would be good to have a good documentation, so normal mappers know, when to remove it. Like in case I merge to segments or split a segment of a waterway, what to do with that ID.

But in my pointy of view the question is: What is the benefit of having the ID at all? Yes, they are in OSM now. But someone adding a new river/stream/… from aerial image or on the ground survey. Do we expect him to research this data and add it? Or would anyone else add this data? What is the benefit for OSM doing so?

If no one is going to add such kind of data or even we don’t recommend to add such data, why keeping the stuff in and even modifying it?

Unlike a lot of imported IDs, a GNIS feature ID has meaning on its own. The GNIS feature ID is the only permanent ID that every federal government agency uses to refer to a populated place, and it’s the primary ID used for natural features, hence its inclusion in the NHD import.

Mappers do go around manually adding gnis:feature_id to features by looking them up in GNIS. This isn’t discouraged as far as I know. Especially for natural features, GNIS can be a good source of names (as the official national gazetteer), so the GNIS feature ID acts as a citation for the name. That said, the GNIS feature ID may become less important over time as Wikidata imports more of GNIS and we continue to link features to Wikidata.

For reference, here’s an Ohsome API chronology of all the popular GNIS feature ID keys. This is just a raw count of elements, not unique values:

The ID tag is extremely useful as various datasets do receive updates and we’re much more easily able to apply those into OSM. Here’s an set of MR task using the “summit” data from the federal government to find unmapped or mismapped features: MapRoulette. Without the ID, the effort to match things up to update is a huge pain and leads to lots of false positives.

You can read more about matching/updating/improving here: Improving the quality of OSM using the GNIS data set

That part I understand. Though what shall I do with it in OSM. For example wikidata, I know from the Tag it belongs to wikidata and I go to wikidata, search the ID and can see whether it belongs to the Starbucks or to the building, the Starbucks is in. So in case the Starbucks is gone, I can decide, whether the wikidata-ID should be removed or not.

I can’t find anything for gnis… How I suppose to find out what 00635966 is? If there is no description about what the ID is about and how a normal mapper can check it, the ID will useless pretty soon and all the updating is rather random and a more advanced mapper would rather remove it than waiting some script is doing something to an object he verified on the ground.

That part is easy. Just search the Geographic Names Information Service operated by the US Geological Survey.

You can search using the ID, name, or any of several other attributes.

Edit: If you’d like to know more about GNIS, there’s lots of information online! https://www.usgs.gov/faqs/what-geographic-names-information-system-gnis

1 Like

I’m good with it. Just saying, this information should be on a place, where the typical mapper is looking for. :wink:

Ok, gnis:feature_id=* has such kind of description, gnis:id not.

1 Like

Thanks everyone for your thoughts and responses. Here’s my summary of the discussion so far. Please let me know if you feel like this is incomplete or inaccurate.

  • There is a solid majority of support for moving to a single gnis:id tag.
  • No objections/concerns raised yet to how I plan to do this work.
  • There is general support for migration to either the defacto standard of gnis:feature_id OR moving to a completely new but more “modern” ref:US:gnis tag.
  • The pros for gnis:feature_id are that it already has widespread software support in editors and consumers. It also requires the fewest elements to be touched. JOSM will need to be updated to stop using ref:gnis.
  • The pros for ref:US:gnis are that it moves to the more modern way of handling these identifiers. The cons are that it may involve a long tail of support from various editors and consumers (iD, JOSM, Go Map!! etc)

There are secondary concerns that, while important, I would prefer to move to other threads if folks would like to continue. I consider them out of scope for the purposes of my proposal here.

  • What does a complete and proper cleanup of the gnis and nhd import tags look like?
  • Improvements to the current GNIS wiki documentation.
3 Likes

I have posted in the OSM US Slack #iD channel asking for guidance about how tag migration works with respect to iD tagging schema and how best to minimize disruptions. That schema seems to be used by a good chunk of editors so if it’s not a huge hassle then we can consider moving to ref:us:gnis=. Otherwise, I’ll start the cleanup moving to the defacto gnis:feature_id tag and moving to ref: can be another exercise.

I will probably start mid next week with one state just to get my sea legs under me and see what else pops up or if I need to change strategies.

Does that seem sensible to everyone?

3 Likes

I spent a very brief time looking at this for WA this afternoon. To my horror I have discovered a large number of elements that have one of the GNIS ID tags but do not have a name= tag. The few I clicked through were from folks moving the name to another (presumably more accurate?) element and abandoning the GNIS tags on the old element. Ugh.

This is orthogonal to my proposed project but I wanted y’all to get to follow along in my slow descent into the abyss.

Sounds like we could use some better Wiki documentation on mapping with GNIS, especially since GNIS released a new dataset and the ID key is changing. Maybe I can put something together soon.

2 Likes

Rather than a ref, maybe years into the future just as a source=GNIS (source where the name, coordinates came from).

I like the use of wikidata as an identifier aggregrater; allows a OSM object to have a one:many relationship to an index of identifiers with “q” as the index value and skip having so many identifiers (ref=*) in OSM itself.

Many entries in GNIS are inaccurate anyways, many places missing, colonial names used instead of indigenious, etc, etc. Again, better as a source rather than an authoritative, indisputable, always-accurate, go-to gazetteer. Why just use some name picked up by a GNIS informant, steward, or assigned by some 1800’s surveyor, chosen by mining claim operators, etc. and maybe even inaccurately placed by some USGS cartographer long ago in the days before GPS and DEM’s? Again, GNIS as a useful source but not always right (all models are wrong but some are useful?)

A general overhaul of identifiers (names, ref etc) is definitely interesting and we’re seeing some projects move to wikidata as the source for names etc.

I don’t have an opinion on it one way of the other but I suspect any later overhaul will be easier once we have a single tag on the OSM side for GNIS ids.

Catching up here.

After the discussion here and a bit of experimenting on WA state, I decided to migrate to the defacto value (gnis:feature_id) and move to a “ref:” tag as a secondary exercise. I have started this work and vanquished one of the synonyms (nhd:gnis_id) as well as making some work of the other. It’s a bit involved as I also end up fixing up import issues (overlapping ways etc) as well as needing to dig into cases where the various synonyms disagree.

Anyhow, I should make reasonable progress in the next few weeks. If someone spots something I’ve broken, please let me know!

1 Like