Announcing📣: US boundary tagging QA checker

I’m pleased to announce a new utility for finding issues with boundary relations in the United States. I operate a data consumer that uses boundary data, so having good boundary data is important to me. This utility inspects on boundaries with admin_level values between 7 and 9 as well as CDPs tagged boundary=census. I also cross-reference these boundaries with wikidata and (in the case of CDPs), the US Census Bureau.

The state-by-state validator findings can be found here: Boundary Quality Assurance Checks
The source code for this validator can be found here: GitHub - ZeLonewolf/wikidata-qa: Wikidata quality assurance scripts

These results will update daily.

At this point, I am the most interested in hearing feedback on the utility itself, and whether the checkers are picking up the right things, and if not, how the logic should work instead (the more specific, the better). I am also interested in any additional checks that I might implement at the tagging and correlation level, as there are other tools for checking geometry validity. If you are so motivated to correct any issues identified by this tool, that is of course most welcome!

This tool make use of external data from wikidata and the US Census Bureau. We may find that a problem flagged in this tool is a wikidata issue rather than an OSM issue, so I expect that state-by-state, someone knowledgeable will need to do some manual inspection on these findings to figure out how they should be resolved, or if the tool is reporting a false positive.

8 Likes

Awesome! I’m definitely going to use this to shake down some SoCal boundaries.

Could you possibly add “boundary=aboriginal_lands” to this as well, for reservations?

1 Like

Great tool! I looked at some of the missing ones - they are already entered as ‘place=neighborhood’ nodes. Would the proposed correction be to find the latest CDP boundary from TIGER and convert the node to some type of area polygon?

Since we’ve mapped CDPs (initially as a mistake but over time we’ve kept them), I would recommend bringing in the latest CDP boundaries from the census bureau and maintaining them, as they can change over time. However, the place node may still be appropriate if it represents a named population center.

I wouldn’t know what checks would be appropriate to apply in those cases, but if you have a good handle on what the logic should be, I would welcome a writeup describing it.

Relatively few CDPs represent what we’d tag as a place=neighbourhood, so that tagging might also be worth looking into.

I wonder if that’s an artifact of the old TIGER data. The new CDP boundaries are supposed to be distinct from “incorporated” places, as recently discussed on Slack.

Just to put it out there (this is a pet peeve of mine) the TIGER boundaries are on the original NAD83, and don’t “fit” to GPS data or imagery… there is a noticeable offset (this was the case with the original import as well) I have the 2023 TIGER boundaries converted to WGS83(G2139) up on Google Drive at TIGER 2023 Places - WGS84(G2139) - Google Drive

Thanks for this QA tool! I’ll be chipping away at Vermont boundaries.

Why do you recommend this? CDPs don’t seem a great fit for OSM to me as they have no on the ground evidence of existence (at least that I’ve seen). The place name corresponding with a CDP will have on the ground evidence, but the actual polygon designated by the Census Bureau is generally not something marked in the real world unless it is an exact match for an admin boundary (in which case we’d map that).

It’s also kind of unclear what to do when a CDP boundary in TIGER seems to want to follow some real-world feature but doesn’t. Sometimes it’s just the usual TIGER exaggerations; other times, an adjacent municipality has annexed into the CDP, but the CDP hasn’t been officially updated yet.

Retagging the imported CDPs as boundary=census was a concession for the CDPs that fit a popular notion of a named place’s extent, such as Bethesda, Maryland. We didn’t want the boundary to seem authoritative and administrative, but it wasn’t necessarily a call to action for completeness. For those named places with well-defined boundaries, there’s also boundary=place, which applies even where there’s no CDP, like in this Buffalo neighborhood.

2 Likes

What I mean to say is that since, in the 15 years since TIGER 2008 when these all showed up, we haven’t managed to form any consensus to remove them, we may as well maintain them if they’re going to be in the database.

With my data consumer hat on, I will say that they’re useful in places where you need municipal boundaries where they don’t otherwise exist (for example, Hawaii and Maryland). This is also not a specific argument to keep them, but I did want to point out that they are selectively in use.

It would be easy enough to tune the QA tool if we decided to do away with these, but I’d advise that discussion to live on a dedicated thread.

1 Like

That does make some sense. I guess I was stuck in my “municipal boundaries are everywhere” New England centric thinking again. :grinning:

The US continues to be polished to a high gloss in OSM: I find it exciting that we have such enthusiasm to improve / clean-up both TIGER-imported boundaries (sometimes boundary=census data makes sense to include in OSM, sometimes less so, especially as these do not age well at all), as well as boundary polygon relations in general.

I thank all the tool builders, mappers and dialog / discussion participants on how we best do this, as it both has resulted in and continues to impress many that OSM can produce really high-quality data that emerges with both consensus among us (we, the contributors / owners of these data) as well as virtually continual improvement that makes our data better and better as the years go by. As someone who watches, participates, maps, wiki-writes and tool-builds myself (e.g. MapRoulette), this isn’t simply self-congratulatory (I don’t like to “toot my own horn”), I really mean to say “thanks to many” here. Yeah!

Love this work! I found some issues already.

1 Like

There are at least two boundaries in Maryland that are both a military base and a census designated place. How should this be handled on the Wikidata side? Example

Aberdeen Proving Ground

A US Army test facility (base).

The issue is that the wikidata link is for the base, not the CDP. Should I create a separate wikidata item for the CDP and use P1889 to flag it as different?

Bug report? There is one that has the issue of not being on the CDP list for Maryland (Relation: ‪Woodlawn‬ (‪133521‬) | OpenStreetMap) I think it is a bug because this item is on that list. There are two CDPs with the same name. A possible bug is that your QA tool is missing the second occurrence of the same name.

1 Like

Lol. What the hell, Maryland…

Okay, so the issue is that in the census bureau API (which is what I actually access rather than the linked table), it comes in as:

Woodlawn CDP (Baltimore County), Maryland

and

Woodlawn CDP (Prince George’s County), Maryland

So now that I know how the census bureau API names duplicates, I think I can code in a workaround…

Why do I have the feeling that as Brian does this, he (and the rest of us, by osmosis of using his tool) are going to get some serious schooling in many other examples of “what the hell, (fill in the place with its very own boundary quirks)”?

But, (to quote Martha Stewart): “that’s a good thing.”

Here’s another one. Maybe a wikidata expert can help. Bennsville, Maryland is misspelled by the US Census. This is even noted in the article. Is there a wikidata property to say ‘this is often mispelled as’ ? The flag will remain since wikidata has the correct spelling but USCB is wrong.

Bennsville [2] (spelled Bensville by the United States Census Bureau[3])

I’ve said this many times in the context of OSM (-US), as it is true and widely acknowledged: there is “what is” and there is “what the Census Bureau says there is.” Now, I realize that sometimes what the Census Bureau says is quite helpful, especially in the (sometimes quite narrow) context in which it states its data, and so what the differences are can be helpful to be pointed out. Even at the granularity of a case-by-case basis, it frequently behooves OSM to carefully ask ourselves “well, so what? that the Census Bureau says (fill in the blank).” And then, there are times and places where Census Bureau data are quite valuable to OSM.

My father served in the US Army and was once told by his commanding officer, “There’s the right way, the wrong way, and the Army way.” OSM data and Census Bureau data (as we compare and consider whether correct or appropriate or not) are kind of like that.